0

I have a very large dataset that looks like this:

y <- structure(list(external_gene_name = structure(c(1L, 1L, 1L, 6L, 
      6L, 4L, 3L, 5L, 5L, 2L), .Label = c("FAM87B", "ISG15", "KLHL17", 
      "NOC2L", "PLEKHN1", "SAMD11"), class = "factor"), shortestDistance = c(99L, 
      2L, 1552L, 885L, 1677L, 2160L, 882L, 421L, 497L, 1246L), A = c(8.388, 
      9.913, 22.876, 19.826, 25.163, 122.766, 122.766, 40.414, 16.013, 
      155.554), B = c(3.403, 0.851, 0.851, 33.179, 26.373, 80.821, 
      80.821, 8.507, 17.015, 165.045), C = c(0.541, 1.622, 11.892, 
      2.162, 3.243, 17.298, 17.298, 0.541, 1.081, 33.515)), .Names = c("external_gene_name", 
      "shortestDistance", "A", "B", "C"), row.names = c(5L, 7L, 8L, 
      19L, 20L, 21L, 22L, 23L, 25L, 31L), class = "data.frame")


# external_gene_name shortestDistance       A       B      C
# 5              FAM87B               99   8.388   3.403  0.541
# 7              FAM87B                2   9.913   0.851  1.622
# 8              FAM87B             1552  22.876   0.851 11.892
# 19             SAMD11              885  19.826  33.179  2.162
# 20             SAMD11             1677  25.163  26.373  3.243
# 21              NOC2L             2160 122.766  80.821 17.298
# 22             KLHL17              882 122.766  80.821 17.298
# 23            PLEKHN1              421  40.414   8.507  0.541
# 25            PLEKHN1              497  16.013  17.015  1.081
# 31              ISG15             1246 155.554 165.045 33.515

I need to subset this data so that for every factor level of external_gene_name, only the entry with the smallest shortestDistance is kept. I have found this implementation:

library(plyr)
ddply(y, .(external_gene_name), summarise, shortestDistance=min(shortestDistance))
# external_gene_name shortestDistance
# 1             FAM87B                2
# 2              ISG15             1246
# 3             KLHL17              882
# 4              NOC2L             2160
# 5            PLEKHN1              421
# 6             SAMD11              885

However this only keeps the grouping factor column, and the summarizing column. I need to keep all columns from the original data set. To accomplish this, I have also figured out this implementation:

do.call(rbind, by(y, y$external_gene_name, function(z) z[which.min(z$shortestDistance), ] ))
# external_gene_name shortestDistance       A       B      C
# FAM87B              FAM87B                2   9.913   0.851  1.622
# ISG15                ISG15             1246 155.554 165.045 33.515
# KLHL17              KLHL17              882 122.766  80.821 17.298
# NOC2L                NOC2L             2160 122.766  80.821 17.298
# PLEKHN1            PLEKHN1              421  40.414   8.507  0.541
# SAMD11              SAMD11              885  19.826  33.179  2.162

This returns the data with the original columns intact. However, both of these implementations are incredibly slow. My full data set includes a variable number of sample (e.g. A through C here, 50+), and ~15,000 external_gene_name factor levels with 1-20+ observations each. It takes upwards of 15 minutes to process a single dataset this way, and I have 500+ of these data sets to process. Is there a better way to accomplish this? I have seen data.table implementations of similar use cases which were purportedly faster, but I have not been able to figure out how to implement them.

user5359531
  • 3,217
  • 6
  • 30
  • 55
  • 1
    Might by something useful here - http://stackoverflow.com/questions/16227223/aggregate-by-factor-levels-keeping-other-variables-in-the-resulting-data-frame/16229636 – thelatemail Jul 13 '16 at 00:52

3 Answers3

2

Here is one of the data.table way you can try with:

library(data.table)
setDT(y)
y[, .SD[shortestDistance == min(shortestDistance)], .(external_gene_name)]
#    external_gene_name shortestDistance       A       B      C
# 1:             FAM87B                2   9.913   0.851  1.622
# 2:             SAMD11              885  19.826  33.179  2.162
# 3:              NOC2L             2160 122.766  80.821 17.298
# 4:             KLHL17              882 122.766  80.821 17.298
# 5:            PLEKHN1              421  40.414   8.507  0.541
# 6:              ISG15             1246 155.554 165.045 33.515

Another option that might be faster is:

setkey(y, external_gene_name, shortestDistance)
y[, head(.SD, 1), .(external_gene_name)]
#    external_gene_name shortestDistance       A       B      C
# 1:             FAM87B                2   9.913   0.851  1.622
# 2:              ISG15             1246 155.554 165.045 33.515
# 3:             KLHL17              882 122.766  80.821 17.298
# 4:              NOC2L             2160 122.766  80.821 17.298
# 5:            PLEKHN1              421  40.414   8.507  0.541
# 6:             SAMD11              885  19.826  33.179  2.162
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 3
    Or if you only want one row in the case of duplicates - `y[, .SD[which.min(shortestDistance)], by=external_gene_name]` – thelatemail Jul 13 '16 at 00:54
  • @thelatemail Definitely more succinct and likely more efficient also. – Psidom Jul 13 '16 at 00:56
  • 2
    Arguable, I actually prefer your answer most of the time, as duplicate records are the bane of my existence these days :-) – thelatemail Jul 13 '16 at 00:58
  • Thanks. Is there a formal name for the `.SD` argument here, and for the `.()` syntax? I have been having a hard time finding an explanation of them on Google. – user5359531 Jul 13 '16 at 01:01
  • 1
    `.(...)` is equivalent to `list(...)` and `.SD` stands for selected columns if `.SDcols` is provided. It will be all of the columns by default if `.SDcols` parameter is not specified. – Psidom Jul 13 '16 at 01:03
  • It doesn't seem like I can explain better. Apparently I am not an expert with `data.table` either. – Psidom Jul 13 '16 at 01:08
  • 2
    Wow, I ran this command on my full dataset, which had taken 15+ minutes to complete with the methods I posted. This method took less than 30 seconds. – user5359531 Jul 13 '16 at 01:10
  • 3
    If you already sorting the data (in your second option), just do `unique(y, by = "external_gene_name")`. – David Arenburg Jul 13 '16 at 03:25
1

Here's a dplyr solution,

library(dplyr)
y %>%
  group_by(external_gene_name) %>%
  do(.[which.min(.$shortestDistance), ])
Ben Fasoli
  • 526
  • 3
  • 7
1

We can order by 'shortestDistanceafter grouping by 'external_gene_name' and useunique` to get the first row for each group.

library(data.table)
unique(setDT(y)[order(external_gene_name, shortestDistance)], by = "external_gene_name")
#   external_gene_name shortestDistance       A       B      C
#1:             FAM87B                2   9.913   0.851  1.622
#2:              ISG15             1246 155.554 165.045 33.515
#3:             KLHL17              882 122.766  80.821 17.298
#4:              NOC2L             2160 122.766  80.821 17.298
#5:            PLEKHN1              421  40.414   8.507  0.541
#6:             SAMD11              885  19.826  33.179  2.162

Or another option is .I

setDT(y)[y[, .I[shortestDistance == min(shortestDistance)], .(external_gene_name)]$V1]

Or using dplyr

library(dplyr)
y %>%
  group_by(external_gene_name) %>%
 filter(shortestDistance == min(shortestDistance)

Or

y %>%
  group_by(external_gene_name) %>%
  slice(which.min(shortestDistance))

Or a base R option would be

y[!duplicated(y[with(y, order(external_gene_name, shortestDistance)),1]),]
akrun
  • 874,273
  • 37
  • 540
  • 662