I am currently trying to filter my data.frame by rows with a unique values from two of my columns(gene_name, and UMI). I am using the distinct function in the dplyr package to do this. When my list is short the code runs really quickly, however when the data.frame is very large 100 million rows or so the program takes what seems like forever to run. Is there a more efficient way of solving this problem?
Here is what I am currently doing ( this is just a snippet from a larger program):
df <- read.delim("hash_test.txt")
df = arrange(df, Gene)
filter_umis = df %>% distinct(Gene, UMI)
Here is some sample data I used to test. The actual data is much larger:
LN.Tfr.1 LN.Tfr.2 LN.Tfr.3 Gene UMI
27.129 25.324 19.49333333 Tubgcp6 GCCC
8.887 8.886 5.924333333 Tubgcp6 GCCC
4.21 14.661 9.017 Uba52 GTTT
40.693 12.884 22.59466667 Ube2d2 GCAC
1.871 2.221 1.364 Ube2d3 GCAG