0

I know how to do this the long way, but I know there is a shorter and simpler solution in R. I have two dataframes: "tpm" which has column names of sample IDs, rownames of genes, and values as TPMs and "mani" which has sample IDs in the "sample" column and mutations in the "mutation" column. I want to filter the "tpm" dataframe for genes that are expressed at >= 5 TPMs in 30% of samples in at least 1 of the 20 mutations.

Input:

tpm df[18,000 x 1500]

Gene    Sample A        Sample B        Sample C        Sample D        Sample E ... 

6kbHsap 5               10              2               0               2
ACRO1   0               0               3               4               5
ALINE   0               0               2              10               1
ALR     7               1               21              1               0
...
mani df[1500 x 2]

    sample         mutation

1   Sample A       X
2   Sample B       X
3   Sample C       X
4   Sample D       Y
4   Sample E       X
...

Result:

tpm df[10,000 x 1500]

Gene    Sample A        Sample B        Sample C        Sample D        Sample E ... 

6kbHsap 5               10              2               0               2
ALINE   0               0               2              10               1
ALR     7               1               21              1               0
...

How could I do this in as few lines of code as possible?

Jack Pep
  • 3
  • 2
  • 2
    Not sure what you mean by "30% of samples in at least 1 of the 20 mutations". Would be great if you can provide the code you have tried. Please also provide reproducible sample data as mentioned at https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. – William Wong Aug 09 '23 at 19:51

0 Answers0