2

I have a dataset of with a list of genes that I have used 2 machine learning models on, and so have 2 sets of predicted scores. I am looking to identify how many genes are in a similar score range between the 2 groups.

For example my data looks like this:

Gene1       Score1       Gene2      Score2
PPL         0.77         COL8A1     0.78
NPHS2       0.77         ARHGEF25   0.77
EHD4        0.75         C1GALT1    0.77
THBS1       0.74         CEP164     0.76
PRKAA1      0.74         MLLT3      0.76
WNT7A       0.73         PPL        0.76
DVL1        0.72         MRVI1      0.75
TUBGCP4     0.71         BMPR1B     0.75
SARM1       0.71         RAB1A      0.75
VPS4A       0.70         CLTC       0.75

In this, the only matching gene in the 2 lists is PPL - I'm trying to write code to pull this out so e.g. the code gives all matching genes between the 2 lists with a score >0.75. I'm trying to do this to check genes at multiple score thresholds.

I've looked at using code from similarly worded questions, but none have a similar data structure that works with mine. I've tried using filter() and match() but haven't got it working, any help would be appreciated.

Input data:

dput(df)
structure(list(Gene1 = c("PPL", "NPHS2", "EHD4", "THBS1", "PRKAA1", 
"WNT7A", "DVL1", "TUBGCP4", "SARM1", "VPS4A"), `Score1` = c(0.78, 
0.77, 0.75, 0.74, 0.74, 0.73, 
0.72, 0.71, 0.71, 0.70), Gene2 = c("COL8A1", 
"ARHGEF25", "C1GALT1", "CEP164", "MLLT3", "PPL", "MRVI1", "BMPR1B", 
"RAB1A", "CLTC"), `Score2` = c(0.78, 0.77, 
0.77, 0.76, 0.76, 0.76, 0.75, 
0.75, 0.75, 0.75)), row.names = c(NA, -10L
), class = c("data.table", "data.frame"))
Waldi
  • 39,242
  • 6
  • 30
  • 78
DN1
  • 234
  • 1
  • 13
  • 38

2 Answers2

2

You can self join the data frame with itself to get all the common genes in the data.

library(dplyr)

inner_join(df, df, by = c('Gene1' = 'Gene2')) %>%
  select(Gene1, Score1 = Score1.x,  Score2 = Score2.y)

#   Gene1 Score1 Score2
#1:   PPL   0.78   0.76

You can then filter Score1 and Score2 based on some threshold.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

Staying in data.table:

library(data.table)
df1 <- df[,.(Gene1,Score1)]
df2 <- df[,.(Gene2,Score2)]

threshold <- 0.75
df1[df2, on = .(Gene1 = Gene2)][Score1 > threshold & Score2 > threshold]

   Gene1 Score1 Score2
1:   PPL   0.78   0.76
Waldi
  • 39,242
  • 6
  • 30
  • 78