I run into a bit of trouble with my code and would welcome any suggestion to make it run faster. I have a data frame that looks like that:
Name <- c("a","a","a","a","a","b","b","b","b","c")
Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")
More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)
So I have duplicated entries for each row in column Name (the number of duplicates can vary). For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. If an entry has an equal number of categories, I want to take one of most categories randomly. So in this case, the output dataframe would look like this:
Name <- c("a","b","c")
Category <- c("sun","dog","cat")
More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)
a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed. My function looks like this:
my_choosing_function <- function(x){
tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
snp_freq <- as.data.frame(table(tmp$consequence_type_tv))
best_hit <- snp_freq[order(-snp_freq$Freq),]
best_hit$SNP<-list_of_snps[x]
top<-best_hit[1,]
return(top)
}
trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)
Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. I do a lapply for each element of the list of unique values, then do a rbind of the whole thing.
As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 4 seconds for 100 lines, that would be a total of 34 hours for the lapply.
I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. Anyone has an idea? Thanks a lot for your help!