1

I am currently trying to filter my data.frame by rows with a unique values from two of my columns(gene_name, and UMI). I am using the distinct function in the dplyr package to do this. When my list is short the code runs really quickly, however when the data.frame is very large 100 million rows or so the program takes what seems like forever to run. Is there a more efficient way of solving this problem?

Here is what I am currently doing ( this is just a snippet from a larger program):

df <- read.delim("hash_test.txt")
df = arrange(df, Gene)
filter_umis = df %>% distinct(Gene, UMI)

Here is some sample data I used to test. The actual data is much larger:

LN.Tfr.1    LN.Tfr.2    LN.Tfr.3    Gene    UMI
27.129      25.324      19.49333333 Tubgcp6 GCCC
8.887       8.886       5.924333333 Tubgcp6 GCCC
4.21        14.661      9.017       Uba52   GTTT
40.693      12.884      22.59466667 Ube2d2  GCAC
1.871       2.221       1.364       Ube2d3  GCAG
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Alos
  • 2,657
  • 5
  • 35
  • 47
  • Should `genes` be `Gene` ? – steveb Apr 02 '16 at 22:33
  • Also have you considered [fread](http://stackoverflow.com/a/15058684/3576984) – MichaelChirico Apr 02 '16 at 22:43
  • @MichaelChirico I think that did the trick I just ran it on a medium size data set and it was twice as fast. I am going to run it overnight on the larger dataset and will get back to you. Thank you! – Alos Apr 02 '16 at 23:31
  • 1
    @Alos It's probably a magnitude faster once you sorted the data set using `setkey`. – lukeA Apr 02 '16 at 23:46
  • How is the data stored? SQL? – alistaire Apr 03 '16 at 05:13
  • @alistaire the data is just stored on the server. We don't need to keep it for long. Once we get results from the sequencing facility we process people's data and send the the results. They are not kept by us. Thanks – Alos Apr 03 '16 at 20:14
  • @MichaelChirico can you post your solution as an answer so that I can accept it? It has made a big impact in the runtime of the program. Thank you! – Alos Apr 03 '16 at 20:58
  • you should provide reproducible example – jangorecki Apr 03 '16 at 22:52
  • 4
    @lukeA setting key requires 1. computing the order, and 2. reordering the data in RAM. Here, reordering is unnecessary and should be avoided.. i.e., using `unique()` directly IMHO would/should be more performant. Keys are advantageous for repetitive operations (as explained in the new [Secondary indices and auto indexing](https://github.com/Rdatatable/data.table/wiki/Getting-started) vignette). – Arun Apr 03 '16 at 22:53

1 Answers1

3

unique in data.table is a likely candidate for a speedier alternative.

library(data.table)
setDT(df) # converts data.frame to data.table *without* copying
out <- unique(df, by = c("genes", "umi"))
# alternatively in one go:
out = unique(setDT(df), by = c("genes", "umi"))

If you would like to only select some columns, then using duplicated() as shown below would be slightly more efficient:

setDT(df)
out = df[!duplicated(df, by = c("genes", "umi")), select_cols, with = FALSE]
# select_cols is a character vector of column names to be in result

This is because data.table sees the rows we'd like to subset from the first i argument, and also the columns we want to select from the second j argument beforehand, and doesn't materialise any other columns unnecessarily.

See here for more on data.table.

Arun
  • 116,683
  • 26
  • 284
  • 387
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198