1

Let's say I have a file with two columns labeled A and B. Each column consists of different strings, with repetition allowed. The A column is already sorted. Here is an example:

A       B
c1045   GO:0003735
c1045   GO:0005829
c1045   GO:0005840
c1045   GO:0006412
c1045   GO:0019843
c11467  GO:0003735
c11467  GO:0005840
c11467  GO:0006412
c1168   GO:0006950
c1168   GO:0006950
c1175   GO:0003674
c1175   GO:0003729
c1175   GO:0003735
c1175   GO:0006412

I want to create a new file where each string in the A column will appear only once with the corresponding strings concatenated in the B column.

The resulting file will begin with:

A       B
c1045   GO:0003735,GO:0005829,GO:0005840,GO:0006412,GO:0019843.
c11467  GO:0003735,GO:0005840,GO:0006412.

Is there an easy way to do so in R ?

bela83
  • 268
  • 1
  • 10

1 Answers1

3

Is this what you are looking for?

library(data.table)
dt <- data.table(df)
##
R> dt[,lapply(.SD,function(x) {
    paste0(x,collapse=",")
  }),by=A]
        A                                                      B
1:  c1045 GO:0003735,GO:0005829,GO:0005840,GO:0006412,GO:0019843
2: c11467                       GO:0003735,GO:0005840,GO:0006412
3:  c1168                                  GO:0006950,GO:0006950
4:  c1175            GO:0003674,GO:0003729,GO:0003735,GO:0006412

Data:

df <- read.table(
  text="A       B
c1045   GO:0003735
c1045   GO:0005829
c1045   GO:0005840
c1045   GO:0006412
c1045   GO:0019843
c11467  GO:0003735
c11467  GO:0005840
c11467  GO:0006412
c1168   GO:0006950
c1168   GO:0006950
c1175   GO:0003674
c1175   GO:0003729
c1175   GO:0003735
c1175   GO:0006412",
  header=TRUE,
  stringsAsFactors=F)
nrussell
  • 18,382
  • 4
  • 47
  • 60