R: Suggestion to speed up a function (remove duplicates in data frame)

Question

I run into a bit of trouble with my code and would welcome any suggestion to make it run faster. I have a data frame that looks like that:

Name <- c("a","a","a","a","a","b","b","b","b","c")

Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")

More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)

So I have duplicated entries for each row in column Name (the number of duplicates can vary). For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. If an entry has an equal number of categories, I want to take one of most categories randomly. So in this case, the output dataframe would look like this:

Name <- c("a","b","c")

Category <- c("sun","dog","cat")

More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)

a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed. My function looks like this:

    my_choosing_function <- function(x){
      tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
      snp_freq <- as.data.frame(table(tmp$consequence_type_tv)) 
       best_hit <- snp_freq[order(-snp_freq$Freq),]
      best_hit$SNP<-list_of_snps[x]
      top<-best_hit[1,]
      return(top)
    }
    trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)

Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. I do a lapply for each element of the list of unique values, then do a rbind of the whole thing.

As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 4 seconds for 100 lines, that would be a total of 34 hours for the lapply.

I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. Anyone has an idea? Thanks a lot for your help!

score 3 · Answer 1 · answered Dec 07 '17 at 03:23

Note: This should be a very long comment because I use `data.table` instead of `dplyr`.

I suggest use data.table because it runs faster. And in the data.table way shown below, it randomly choose one in case of tie, not always the first one.

library(data.table)
library(dplyr)
library(microbenchmark)

d <- data.frame(
    Name = as.character(sample.int(10000, 2.5e6, replace = T)),
    Category = as.character(sample.int(10000, 2.5e6, replace = T)),
    More_info = rep('table', 2.5e6)
)

Mode <- function(x) {
    ux <- unique(x)
    fr1 <- tabulate(match(x, ux))
    if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

system.time({
    d %>%
        group_by(Name) %>%
        slice(which(Category == Mode(Category))[1])
})
#    user  system elapsed
#  45.932   0.808  46.745

system.time({
    dt <- as.data.table(d)
    dt.max <- dt[, .N, by = .(Name, Category)]
    dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
    dt.max <- dt.max[r == 1, .(Name, Category)]

    dt[dt.max, on = .(Name, Category), mult = 'first']
})
#    user  system elapsed
#   2.424   0.004   2.426

This solution was insanely fast, I processed the whole 2.5M lines in under 4 minutes instead of 40h for my initial function! Thank you very much. — Vbokito, Dec 07 '17 at 05:41

score 1 · Answer 2 · answered Dec 07 '17 at 02:35

1

We can modify the Mode function from here and then do a group by filter

library(dplyr)

Mode <- function(x) {
 ux <- unique(x)
 fr1 <- tabulate(match(x, ux))
  if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

d %>% 
  group_by(Name) %>%
  slice(which(Category == Mode(Category))[1])

answered Dec 07 '17 at 02:35

akrun

874,273
37
540
662

Does this run faster than the one from OP? I think some benchmarks would be helpful. – mt1022 Dec 07 '17 at 02:47
@mt1022 I haven't checked the benchmarks, but `tabulate` is faster than `table` and the OP specified for `dplyr` solutions. I think the OP will compare the speed as he already have a big dataset – akrun Dec 07 '17 at 02:48
1

Also it seems that the OP's function does not work on the example data. – mt1022 Dec 07 '17 at 02:50
1

@mt1022 It is not clear about the `list_of_snps` i.e. `trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))# Error in lapply(1:length(list_of_snps), function(x) my_choosing_function(x)) : object 'list_of_snps' not found` – akrun Dec 07 '17 at 02:52
1

Yes, I create a list of characters from the unique elements in the column first (I didn't mention it, my code was to show how I was doing overall). Thanks a lot akrun and mt1022 for your replies, I will system time all that and will let you know which one is faster! I really like both of your examples though, they are very clever! – Vbokito Dec 07 '17 at 04:56

score 0 · Answer 3 · answered Jan 19 '18 at 20:14

A couple slight tweaks on @mt1022's solution can produce a marginal speedup, nothing to phone home about, but might be of use if you find your data grows another order of magnitude.

library(data.table)
library(dplyr)

d <- data.frame(
 Name = as.character(sample.int(10000, 2.5e6, replace = T)),
 Category = as.character(sample.int(5000, 2.5e6, replace = T)),
 More_info = rep('table', 2.5e6)
)

Mode <- function(x) {
 ux <- unique(x)
 fr1 <- tabulate(match(x, ux))
 if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

system.time({
 d %>%
   group_by(Name) %>%
   slice(which(Category == Mode(Category))[1])
})

# user   system elapsed 
# 40.459   0.180  40.743 

system.time({
 dt <- as.data.table(d)
 dt.max <- dt[, .N, by = .(Name, Category)]
 dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
 dt.max <- dt.max[r == 1, .(Name, Category)]

 dt[dt.max, on = .(Name, Category), mult = 'first']
})

# user  system elapsed 
# 4.196   0.052   4.267

Tweaks include

Use setDT() instead of as.data.table() to avoid making a copy
Using stats::runif() to generate the random tiebreaker directly, this is of what data.table is doing internally in the the random option of frank()
Using setkey() to sort the table
Sub-setting the table by the row indices, .I, where the row within each group is equal to the number of observations, .N in each group. (This returns the last row of each group)

Results:

system.time({
 dt.max <- setDT(d)[, .(Count = .N), keyby = .(Name, Category)]
 dt.max[,rand := stats::runif(.N)]
 setkey(dt.max,Name,Count, rand)
 dt.max[dt.max[,.I[.N],by = .(Name,Category)]$V1,.(Name,Category,Count)]
})

# user  system elapsed 
# 1.722   0.057   1.750

R: Suggestion to speed up a function (remove duplicates in data frame)

3 Answers3

Note: This should be a very long comment because I use data.table instead of dplyr.

Note: This should be a very long comment because I use `data.table` instead of `dplyr`.