My question is an extension of this question.
I have a dataframe of genetic variants tested for association with genes. Each gene can have multiple variants associated with it. There are 77 million rows. I want the best variant for each gene, ie with the lowest p value.
How can I best approach this? Should I first create a list of all unique genes and next get the best variant based on the p value column?
Here is a representative sample of the data:
EnsemblID VariantID tss_distance CAF_eQTL ma_samples ma_count pval_nominal Beta SE chromosome ... Source AverageMaximumPosteriorCall Info AA_N AB_N BB_N TotalN MAF MissingDataProportion HWE_P
0 ENSG00000187634 1:693731 -240519 0.134185 153 168 0.881492 0.005953 0.039917 1 ... HRCr11 0.902046 0.624501 1637.01 451.103 35.569 2124 0.122957 0.000075 0.544841
1 ENSG00000187634 1:714596 -219654 0.038339 48 48 0.834746 -0.015369 0.073635 1 ... HRCr11 0.968278 0.577564 1987.73 135.495 0.677 2124 0.032217 0.000024 0.272504
2 ENSG00000187634 1:715367 -218883 0.038339 48 48 0.834746 -0.015369 0.073635 1 ... HRCr11 0.975207 0.671960 1976.85 146.512 0.556 2124 0.034753 0.000019 0.176551
3 ENSG00000187634 1:717485 -216765 0.038339 48 48 0.834746 -0.015369 0.073635 1 ... HRCr11 0.975183 0.670147 1978.35 145.041 0.517 2124 0.034388 0.000022 0.174817
4 ENSG00000187634 1:720381 -213869 0.038339 48 48 0.834746 -0.015369 0.073635 1 ... HRCr11 0.974476 0.667536 1973.72 149.503 0.691 2124 0.035521 0.000020 0.110452
...
51006816 ENSG00000176383 12:121614338 -591979 0.398562 392 499 0.188786 0.035313 0.026839 12 ... 1000Gp3v5 0.998417 0.997017 782.742 992.043 349.155 2124 0.397929 0.000014 0.257526
8315020 ENSG00000138036 2:42957539 -834486 0.608626 385 490 0.099884 0.023363 0.014176 2 ... HRCr11 0.986869 0.976969 370.833 958.869 794.226 2124 0.400328 0.000017 0.006637
4355839 ENSG00000176444 1:155829511 558831 0.272364 295 341 0.139756 -0.032983 0.022305 1 ... HRCr11 0.992841 0.985842 1081.700 853.253 188.994 2124 0.289847 0.000012 0.291429
53045255 ENSG00000092199 14:22160099 920783 0.278754 302 349 0.600446 -0.008187 0.015622 14 ... HRCr11 0.980246 0.959707 1119.690 845.517 158.721 2124 0.273775 0.000017 0.956476
27329024 ENSG00000146085 6:50332764 885957 0.112620 139 141 0.832515 -0.009139 0.043195 6 ... HRCr11 0.972682 0.887525 1662.300 444.701 16.958 2124 0.112671 0.000011 0.021358
I would expect then something like this:
EnsemblID VariantID tss_distance CAF_eQTL ma_samples ma_count pval_nominal Beta SE chromosome ... Source AverageMaximumPosteriorCall Info AA_N AB_N BB_N TotalN MAF MissingDataProportion HWE_P
1 ENSG00000187634 1:714596 -219654 0.038339 48 48 0.834746 -0.015369 0.073635 1 ... HRCr11 0.968278 0.577564 1987.73 135.495 0.677 2124 0.032217 0.000024 0.272504
...
51006816 ENSG00000176383 12:121614338 -591979 0.398562 392 499 0.188786 0.035313 0.026839 12 ... 1000Gp3v5 0.998417 0.997017 782.742 992.043 349.155 2124 0.397929 0.000014 0.257526
8315020 ENSG00000138036 2:42957539 -834486 0.608626 385 490 0.099884 0.023363 0.014176 2 ... HRCr11 0.986869 0.976969 370.833 958.869 794.226 2124 0.400328 0.000017 0.006637
4355839 ENSG00000176444 1:155829511 558831 0.272364 295 341 0.139756 -0.032983 0.022305 1 ... HRCr11 0.992841 0.985842 1081.700 853.253 188.994 2124 0.289847 0.000012 0.291429
53045255 ENSG00000092199 14:22160099 920783 0.278754 302 349 0.600446 -0.008187 0.015622 14 ... HRCr11 0.980246 0.959707 1119.690 845.517 158.721 2124 0.273775 0.000017 0.956476
27329024 ENSG00000146085 6:50332764 885957 0.112620 139 141 0.832515 -0.009139 0.043195 6 ... HRCr11 0.972682 0.887525 1662.300 444.701 16.958 2124 0.112671 0.000011 0.021358