I have a dataset of with a list of genes that I have used 2 machine learning models on, and so have 2 sets of predicted scores. I am looking to identify how many genes are in a similar score range between the 2 groups.
For example my data looks like this:
Gene1 Score1 Gene2 Score2
PPL 0.77 COL8A1 0.78
NPHS2 0.77 ARHGEF25 0.77
EHD4 0.75 C1GALT1 0.77
THBS1 0.74 CEP164 0.76
PRKAA1 0.74 MLLT3 0.76
WNT7A 0.73 PPL 0.76
DVL1 0.72 MRVI1 0.75
TUBGCP4 0.71 BMPR1B 0.75
SARM1 0.71 RAB1A 0.75
VPS4A 0.70 CLTC 0.75
In this, the only matching gene in the 2 lists is PPL
- I'm trying to write code to pull this out so e.g. the code gives all matching genes between the 2 lists with a score >0.75. I'm trying to do this to check genes at multiple score thresholds.
I've looked at using code from similarly worded questions, but none have a similar data structure that works with mine. I've tried using filter()
and match()
but haven't got it working, any help would be appreciated.
Input data:
dput(df)
structure(list(Gene1 = c("PPL", "NPHS2", "EHD4", "THBS1", "PRKAA1",
"WNT7A", "DVL1", "TUBGCP4", "SARM1", "VPS4A"), `Score1` = c(0.78,
0.77, 0.75, 0.74, 0.74, 0.73,
0.72, 0.71, 0.71, 0.70), Gene2 = c("COL8A1",
"ARHGEF25", "C1GALT1", "CEP164", "MLLT3", "PPL", "MRVI1", "BMPR1B",
"RAB1A", "CLTC"), `Score2` = c(0.78, 0.77,
0.77, 0.76, 0.76, 0.76, 0.75,
0.75, 0.75, 0.75)), row.names = c(NA, -10L
), class = c("data.table", "data.frame"))