4

I have a dataframe with 2 columns, one Identifier and column with names. Each Identifier is several times present in the column ID (see below).

 ID           Names
uc001aag.1  DKFZp686C24272
uc001aag.1  DQ786314
uc001aag.1  uc001aag.1
uc001aah.2  AK056232
uc001aah.2  FLJ00038
uc001aah.2  uc001aah.1
uc001aah.2  uc001aah.2
uc001aai.1  AY217347

Now I want to create a dataframe like this:

 ID           Names
uc001aag.1  DKFZp686C24272 | DQ786314 | uc001aag.1
uc001aah.2  AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
uc001aai.1  AY217347

Can anyone help me?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Lisann
  • 5,705
  • 14
  • 41
  • 50

2 Answers2

4

You can use aggregate:

R> aggregate(Names ~ ID, data=tmp, FUN=paste, collapse=" | ")
          ID                                         Names
1 uc001aag.1        DKFZp686C24272 | DQ786314 | uc001aag.1
2 uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
3 uc001aai.1                                      AY217347
rcs
  • 67,191
  • 22
  • 172
  • 153
  • @rcs, this method works fine, but I have a very large dataset. is there a way to speed up the analysis? Thanks – Lisann May 10 '11 at 08:18
  • Maybe parallelization with `ddply` from the `plyr` package: `ddply(tmp, .(ID), function(x) paste(x$Names, collapse=" | "), .parallel=TRUE)` – rcs May 10 '11 at 08:31
  • That code from the plyr package gives me this error: Loading required package: foreach Error: foreach package required for parallel plyr operation In addition: Warning message: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : there is no package called 'foreach' – Lisann May 10 '11 at 08:42
  • The `foreach` package is required. Furthermore, you have to register a parallel backend (i.e. `doMC`, not available for Windows platforms). See also here: http://stackoverflow.com/questions/5588914/domc-vs-dosnow-vs-dosmp-vs-dompi-why-arent-the-various-parallel-backends-for-f/5621388#5621388 – rcs May 10 '11 at 08:50
  • There was a file LOCK00 in the directory so the package foreach could not be downloaded. This fixed, but the analysis is still slow. The file contains 2 columns and 450000 rows. – Lisann May 10 '11 at 09:00
  • @Lisann `tapply(tmp$Names,tmp$ID,paste,collapse=" | ")` is a bit faster, but anyway on this scale this is a job for a C code. – mbq May 10 '11 at 10:02
  • 2
    `tapply` can be modified to run in parallel using `snowfall`. – Roman Luštrik May 10 '11 at 10:52
  • Thanks for suggestion, I'm gonna try it! =) – Lisann May 10 '11 at 10:58
  • 2
    I'm not sure how much any of the parallelizing tricks will help much. The problem with ddply in this case (450,000 rows being merged into something like 200,000) is the combine step. The split step is probably not that expensive, and the paste shouldn't be too bad either. But ddply does an rbind repeatedly on the resulting little data.frames, which is a disaster because it has to re-optimize the string storage every time. (I think?) It may be better to use dlply, returning just the pasted string each time, then unlist() to collapse to a vector. Then paste back the ID names. – Harlan May 10 '11 at 15:34
4

Aggregate is quite a fast one, but you can use an sapply solution to parallelize the code. This can easily be done on Windows using snowfall :

require(snowfall)
sfInit(parallel=TRUE,cpus=2)
sfExport("Data")

ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
    paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
sfStop()

The parallel version will give you an extra speedup, but the single sapply solution is actually slower than aggregate. Tapply is a bit faster, but can't be parallelized using snowfall. on my computer :

n <- 3000
m <- 3
Data <- data.frame( ID = rep(1:n,m),
                    Names=rep(LETTERS[1:m],each=n))
 # using snowfall for parallel sapply    
 system.time({
   ID <- unique(Data$ID)
   CombNames <- sfSapply(ID,function(i){
     paste(Data$Names[Data$ID==i],collapse=" | ")
   })
   data.frame(ID,CombNames)
 }) 
   user  system elapsed 
   0.02    0.00    0.33 

 # using tapply
 system.time({
   CombNames <- tapply(Data$Names,Data$ID,paste,collapse=" | ")
   data.frame(ID=names(CombNames),CombNames)
 })
   user  system elapsed 
   0.44    0.00    0.44 

 # using aggregate
 system.time(
   aggregate(Names ~ ID, data=Data, FUN=paste, collapse=" | ")
 )
   user  system elapsed 
   0.47    0.00    0.47 

 # using the normal sapply
 system.time({
   ID <- unique(Data$ID)
   CombNames <- sapply(ID,function(i){
     paste(Data$Names[Data$ID==i],collapse=" | ")
   })
   data.frame(ID,CombNames)
 })
   user  system elapsed 
   0.75    0.00    0.75 

Note:

For the record, the better sapply-solution would be :

CombNames <- sapply(split(Data$Names,Data$ID),paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)

which is equivalent to tapply. But parallelizing this one is actually slower, as you have to move more data around within the sfSapply. The speed comes from copying the dataset to every cpu. This is what you have to keep in mind when your dataset is huge : you'll pay the speed with more memory usage.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263