I know how to do this the long way, but I know there is a shorter and simpler solution in R. I have two dataframes: "tpm" which has column names of sample IDs, rownames of genes, and values as TPMs and "mani" which has sample IDs in the "sample" column and mutations in the "mutation" column. I want to filter the "tpm" dataframe for genes that are expressed at >= 5 TPMs in 30% of samples in at least 1 of the 20 mutations.
Input:
tpm df[18,000 x 1500]
Gene Sample A Sample B Sample C Sample D Sample E ...
6kbHsap 5 10 2 0 2
ACRO1 0 0 3 4 5
ALINE 0 0 2 10 1
ALR 7 1 21 1 0
...
mani df[1500 x 2]
sample mutation
1 Sample A X
2 Sample B X
3 Sample C X
4 Sample D Y
4 Sample E X
...
Result:
tpm df[10,000 x 1500]
Gene Sample A Sample B Sample C Sample D Sample E ...
6kbHsap 5 10 2 0 2
ALINE 0 0 2 10 1
ALR 7 1 21 1 0
...
How could I do this in as few lines of code as possible?