0

I have a dataframe of BLAST HSPs (not all columns displayed):

      query.id subject.id alignment.length
196 1032519524 1032519523              212
197 1032519524 1032519523              182
198 1032519524 1032519522              212
199 1032519524 1032519522              182
200 1032519524 1032518642              212

and would like to collapse the data so I end up with unique pairs of query.id and subject.id. If there is multiple rows with the same query.id and subject.id, the values for alignment.length should be added:

    query.id subject.id alignment.length
1 1032519524 1032518642              212
2 1032519524 1032519522              394
3 1032519524 1032519523              394

I do this with a neat one-liner using plyr:

ddply(blast.results, c("query.id", "subject.id"), function(x)colSums(x['alignment.length']))

Unfortunately, this becomes prohibitive when processing a few hundreds of thousands of BLAST results. Is there a faster and more scalable approach?

Microbenchmark for @PoGibas data.table solutions:

Unit: milliseconds
                                                                                                                            expr
                               setDT(blast.results)[, .(alignment.length = sum(alignment.length)),      .(query.id, subject.id)]
 setkey(setDT(blast.results), query.id, subject.id)[, .(alignment.length = sum(alignment.length)),      .(query.id, subject.id)]
                                                                                                                             100
       min        lq        mean     median        uq        max neval cld
 11.514016 18.010048 31.61341262 22.0045935 32.104018 222.943783   100   b
 15.308905 22.620595 36.32531007 28.2132725 43.527390 156.109477   100   b
  0.000012  0.000185  0.00033057  0.0003635  0.000443   0.000772   100  a
user1981275
  • 13,002
  • 8
  • 72
  • 101

1 Answers1

1

Solution using dplyr (by @hadley):

library(dplyr)
blast.results %>%
    group_by(query.id, subject.id) %>%
    summarise(alignment.length = sum(alignment.length))

Solution using data.table (by @Matt Dowle):

library(data.table)
setkey(setDT(blast.results), query.id, subject.id)[, .(alignment.length = sum(alignment.length)), .(query.id, subject.id)]

As you mentioned that speed is important then you probably want to use data.table (data.table vs dplyr).

pogibas
  • 27,303
  • 19
  • 84
  • 117
  • 1
    agree that the speed thing is generally true, but I would suggest the OP check the speed on both of them and see whether the difference in this case is small enough that they can choose on the basis of comfort/familiarity/speed rather than brute computational efficiency ("a few hundreds of thousands" might not be "big" for the purposes of differentiating dplyr vs data.table) – Ben Bolker Sep 19 '17 at 23:01
  • 1
    Oh wow, data.table looks really fast! Benchmarking it at the moment with a larger dataset, cannot test the dplyr solution at the moment since it fails to install... – user1981275 Sep 19 '17 at 23:14
  • @user1981275 I updated my answer with `setkey` in `data.table`, try it. – pogibas Sep 19 '17 at 23:16
  • @PoGibas In my benchmark (question edited), the one with `setkey` is slightly slower in mean and min values. I don't include my ddply solution since it takes forever – user1981275 Sep 19 '17 at 23:28