How count matching groups by a threshold in r

Question

I have a dataset of with a list of genes that I have used 2 machine learning models on, and so have 2 sets of predicted scores. I am looking to identify how many genes are in a similar score range between the 2 groups.

For example my data looks like this:

Gene1       Score1       Gene2      Score2
PPL         0.77         COL8A1     0.78
NPHS2       0.77         ARHGEF25   0.77
EHD4        0.75         C1GALT1    0.77
THBS1       0.74         CEP164     0.76
PRKAA1      0.74         MLLT3      0.76
WNT7A       0.73         PPL        0.76
DVL1        0.72         MRVI1      0.75
TUBGCP4     0.71         BMPR1B     0.75
SARM1       0.71         RAB1A      0.75
VPS4A       0.70         CLTC       0.75

In this, the only matching gene in the 2 lists is PPL - I'm trying to write code to pull this out so e.g. the code gives all matching genes between the 2 lists with a score >0.75. I'm trying to do this to check genes at multiple score thresholds.

I've looked at using code from similarly worded questions, but none have a similar data structure that works with mine. I've tried using filter() and match() but haven't got it working, any help would be appreciated.

Input data:

dput(df)
structure(list(Gene1 = c("PPL", "NPHS2", "EHD4", "THBS1", "PRKAA1", 
"WNT7A", "DVL1", "TUBGCP4", "SARM1", "VPS4A"), `Score1` = c(0.78, 
0.77, 0.75, 0.74, 0.74, 0.73, 
0.72, 0.71, 0.71, 0.70), Gene2 = c("COL8A1", 
"ARHGEF25", "C1GALT1", "CEP164", "MLLT3", "PPL", "MRVI1", "BMPR1B", 
"RAB1A", "CLTC"), `Score2` = c(0.78, 0.77, 
0.77, 0.76, 0.76, 0.76, 0.75, 
0.75, 0.75, 0.75)), row.names = c(NA, -10L
), class = c("data.table", "data.frame"))

score 2 · Accepted Answer · answered Sep 28 '20 at 09:44

You can self join the data frame with itself to get all the common genes in the data.

library(dplyr)

inner_join(df, df, by = c('Gene1' = 'Gene2')) %>%
  select(Gene1, Score1 = Score1.x,  Score2 = Score2.y)

#   Gene1 Score1 Score2
#1:   PPL   0.78   0.76

You can then filter Score1 and Score2 based on some threshold.

score 2 · Answer 2 · answered Sep 28 '20 at 09:46

2

Staying in data.table:

library(data.table)
df1 <- df[,.(Gene1,Score1)]
df2 <- df[,.(Gene2,Score2)]

threshold <- 0.75
df1[df2, on = .(Gene1 = Gene2)][Score1 > threshold & Score2 > threshold]

   Gene1 Score1 Score2
1:   PPL   0.78   0.76

answered Sep 28 '20 at 09:46

Waldi

39,242
6
30
78

How count matching groups by a threshold in r

2 Answers2