0

My question is an extension of this question.

I have a dataframe of genetic variants tested for association with genes. Each gene can have multiple variants associated with it. There are 77 million rows. I want the best variant for each gene, ie with the lowest p value.

How can I best approach this? Should I first create a list of all unique genes and next get the best variant based on the p value column?

Here is a representative sample of the data:

    EnsemblID   VariantID   tss_distance    CAF_eQTL    ma_samples  ma_count    pval_nominal    Beta    SE  chromosome  ... Source  AverageMaximumPosteriorCall Info    AA_N    AB_N    BB_N    TotalN  MAF MissingDataProportion   HWE_P
0   ENSG00000187634 1:693731    -240519 0.134185    153 168 0.881492    0.005953    0.039917    1   ... HRCr11  0.902046    0.624501    1637.01 451.103 35.569  2124    0.122957    0.000075    0.544841
1   ENSG00000187634 1:714596    -219654 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.968278    0.577564    1987.73 135.495 0.677   2124    0.032217    0.000024    0.272504
2   ENSG00000187634 1:715367    -218883 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.975207    0.671960    1976.85 146.512 0.556   2124    0.034753    0.000019    0.176551
3   ENSG00000187634 1:717485    -216765 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.975183    0.670147    1978.35 145.041 0.517   2124    0.034388    0.000022    0.174817
4   ENSG00000187634 1:720381    -213869 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.974476    0.667536    1973.72 149.503 0.691   2124    0.035521    0.000020    0.110452

...

51006816    ENSG00000176383 12:121614338    -591979 0.398562    392 499 0.188786    0.035313    0.026839    12  ... 1000Gp3v5   0.998417    0.997017    782.742 992.043 349.155 2124    0.397929    0.000014    0.257526
8315020 ENSG00000138036 2:42957539  -834486 0.608626    385 490 0.099884    0.023363    0.014176    2   ... HRCr11  0.986869    0.976969    370.833 958.869 794.226 2124    0.400328    0.000017    0.006637
4355839 ENSG00000176444 1:155829511 558831  0.272364    295 341 0.139756    -0.032983   0.022305    1   ... HRCr11  0.992841    0.985842    1081.700    853.253 188.994 2124    0.289847    0.000012    0.291429
53045255    ENSG00000092199 14:22160099 920783  0.278754    302 349 0.600446    -0.008187   0.015622    14  ... HRCr11  0.980246    0.959707    1119.690    845.517 158.721 2124    0.273775    0.000017    0.956476
27329024    ENSG00000146085 6:50332764  885957  0.112620    139 141 0.832515    -0.009139   0.043195    6   ... HRCr11  0.972682    0.887525    1662.300    444.701 16.958  2124    0.112671    0.000011    0.021358

I would expect then something like this:

    EnsemblID   VariantID   tss_distance    CAF_eQTL    ma_samples  ma_count    pval_nominal    Beta    SE  chromosome  ... Source  AverageMaximumPosteriorCall Info    AA_N    AB_N    BB_N    TotalN  MAF MissingDataProportion   HWE_P
1   ENSG00000187634 1:714596    -219654 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.968278    0.577564    1987.73 135.495 0.677   2124    0.032217    0.000024    0.272504

...

51006816    ENSG00000176383 12:121614338    -591979 0.398562    392 499 0.188786    0.035313    0.026839    12  ... 1000Gp3v5   0.998417    0.997017    782.742 992.043 349.155 2124    0.397929    0.000014    0.257526
8315020 ENSG00000138036 2:42957539  -834486 0.608626    385 490 0.099884    0.023363    0.014176    2   ... HRCr11  0.986869    0.976969    370.833 958.869 794.226 2124    0.400328    0.000017    0.006637
4355839 ENSG00000176444 1:155829511 558831  0.272364    295 341 0.139756    -0.032983   0.022305    1   ... HRCr11  0.992841    0.985842    1081.700    853.253 188.994 2124    0.289847    0.000012    0.291429
53045255    ENSG00000092199 14:22160099 920783  0.278754    302 349 0.600446    -0.008187   0.015622    14  ... HRCr11  0.980246    0.959707    1119.690    845.517 158.721 2124    0.273775    0.000017    0.956476
27329024    ENSG00000146085 6:50332764  885957  0.112620    139 141 0.832515    -0.009139   0.043195    6   ... HRCr11  0.972682    0.887525    1662.300    444.701 16.958  2124    0.112671    0.000011    0.021358

  • 2
    I think it would definitely be helpful if you included some example DF with demonstrative rows. It is kind of hard to imagine your problem without seeing the data and your desired output. – Bijay Regmi Jul 05 '23 at 08:06
  • Your example data is hard to read due to alignment - perhaps you can supply an actual runnable example. For polars, it sounds like: `df.filter(pl.col('pval_nominal') == pl.min('pval_nominal').over('EnsemblID')).unique(['EnsemblID', 'pval_nominal'])` – jqurious Jul 05 '23 at 10:34

1 Answers1

1

Here is a suggestion using Polars. I assume that your data is saved in the file data.parquet and EnsemblID is the gene id column.

import polars as pl

(
    pl.scan_parquet('data.parquet')
    .with_columns(pl.col('pval_nominal').min().over('EnsemblID').alias('min_pval'))
    .filter(pl.col('pval_nominal') == pl.col('min_pval'))
    .collect()
)

In the with_columns we compute the smallest p value for each gene and afterwards keep the row(s) where the p value is equal to the smallest.

robertdj
  • 909
  • 1
  • 5
  • 10