A match that gives me all the coincidences not just the first one

Question

good day, I had previously posted a question but it was too cumbersome and I overloaded it with data. Basically what I want to ask is that if there is any way to generate a dataset with match or some other function that gives me all the matches, even repeated values, when comparing two or more lists and if possible, that this function does not remove the " NA " of the result. Example: I have this 2 frames

Frame 1                          Frame 2
GenIDModel  GenIDOrganisms       GenIDModel GenIDOrganism1
gen_pep01   hsa_pep01            gen_pep01  hsa_pep01
gen_pep01   hsa_pep02            gen_pep01  hsa_pep02
gen_pep01   hsa_pep03            gen_pep01  hsa_pep03
gen_pep03   hsa_pep11            gen_pep03  hsa_pep11
gen_pep05   hsa_pep20            gen_pep05  hsa_pep20
gen_pep02   rno_pep14           
gen_pep05   rno_pep22           
gen_pep05   rno_pep23           
gen_pep05   rno_pep25           
gen_pep01   dre_pep01           
gen_pep03   dre_pep08           
gen_pep08   dre_pep99           
gen_pep11   dre_pep99           
gen_pep02   rno_pep24           
gen_pep03   rno_pep35           
gen_pep05   rno_pep20           
gen_pep07   rno_pep27

When I use match

MatchFrame1vsFrame2 <- match(Frame1$GenIDModel, Frame2$GenIDModel)

I get this

MatchFrame1vsFrame2
# [1]  1  1  1  4  5 NA  5  5  5  1  4 NA NA NA  4  5 NA

And when I extract the names I get this

NamesMatchFrame1vsFrame2 <- Frame2$GenIDOrganism1[MatchFrame1vsFrame2]
NamesMatchFrame1vsFrame2
# [1] "hsa_pep01" "hsa_pep01" "hsa_pep01" "hsa_pep11" "hsa_pep20"
# [6] NA          "hsa_pep20" "hsa_pep20" "hsa_pep20" "hsa_pep01"
# [11] "hsa_pep11" NA          NA          NA          "hsa_pep11"
# [16] "hsa_pep20" NA

But what I actually want is this

# [1] "hsa_pep01" "hsa_pep02" "hsa_pep03" "hsa_pep11" "hsa_pep20"
# [6] NA          "rno_pep22" "rno_pep23" "rno_pep25" "dre_pep01"
# [11] "dre_pep08" NA          NA          NA          "rno_pep35"
# [16] "rno_pep20" NA

Is there any function or series of functions that allows me to obtain something like this?

Note: I also tried with %in% but when I extract the names it doesn't give me all of them

inFrame1vsFrame2 <- Frame1$GenIDModel %in% Frame2$GenIDModel
inFrame1vsFrame2
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
# [11]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
NamesinFrame1vsFrame2 <- Frame2$GenIDOrganism1[inFrame1vsFrame2]
NamesinFrame1vsFrame2
# [1] "gen_pep01" "gen_pep01" "gen_pep01" "gen_pep03" "gen_pep05"
# [6] NA          NA          NA          NA          NA         
# [11] NA          NA

Thanks a lot for your time, have a great day!

I'm confused on the logic here. How would `rno_pep27` be returned in the expected result? — Skaqqs, Jan 20 '22 at 16:07

score 0 · Answer 1 · answered Jan 20 '22 at 17:28

ifelse(test = df1$GenIDModel %in% df2$GenIDModel, yes = df1$GenIDOrganisms, no = NA)

 [1] "hsa_pep01" "hsa_pep02" "hsa_pep03" "hsa_pep11" "hsa_pep20"
 [6] NA          "rno_pep22" "rno_pep23" "rno_pep25" "dre_pep01"
[11] "dre_pep08" NA          NA          NA          "rno_pep35"
[16] "rno_pep20" NA

data:

df1 <- structure(list(GenIDModel = c("gen_pep01", "gen_pep01", "gen_pep01", 
"gen_pep03", "gen_pep05", "gen_pep02", "gen_pep05", "gen_pep05", 
"gen_pep05", "gen_pep01", "gen_pep03", "gen_pep08", "gen_pep11", 
"gen_pep02", "gen_pep03", "gen_pep05", "gen_pep07"), GenIDOrganisms = c("hsa_pep01", 
"hsa_pep02", "hsa_pep03", "hsa_pep11", "hsa_pep20", "rno_pep14", 
"rno_pep22", "rno_pep23", "rno_pep25", "dre_pep01", "dre_pep08", 
"dre_pep99", "dre_pep99", "rno_pep24", "rno_pep35", "rno_pep20", 
"rno_pep27")), class = "data.frame", row.names = c(NA, -17L))

df2 <- structure(list(GenIDModel = c("gen_pep01", "gen_pep01", "gen_pep01", 
"gen_pep03", "gen_pep05"), GenIDOrganism1 = c("hsa_pep01", "hsa_pep02", 
"hsa_pep03", "hsa_pep11", "hsa_pep20")), class = "data.frame", row.names = c(NA, 
-5L))

utubun · Answer 2 · 2022-01-20T18:06:09.693

It's not about functions you use, but about the clear idea you've got about your data.

The logic behind your question, is to replace all the values in df01$GeneIDOrganisms where df01$GenIDModel and df02$GenIDModel do not match, with NAs.

To achieve this, you might want to consider standard base::replace(x, list, values) function. It replaces all values in x, with indexes given by list, with values given by value.

Which, in your particular case could be implemented this way:

replace(df01$GenIDOrganisms, ! df01$GenIDModel %in% df02$GenIDModel, NA)

Data:

df01 <- structure(list(GenIDModel = c("gen_pep01", "gen_pep01", "gen_pep01", 
"gen_pep03", "gen_pep05", "gen_pep02", "gen_pep05", "gen_pep05", 
"gen_pep05", "gen_pep01", "gen_pep03", "gen_pep08", "gen_pep11", 
"gen_pep02", "gen_pep03", "gen_pep05", "gen_pep07"), GenIDOrganisms = c("hsa_pep01", 
"hsa_pep02", "hsa_pep03", "hsa_pep11", "hsa_pep20", "rno_pep14", 
"rno_pep22", "rno_pep23", "rno_pep25", "dre_pep01", "dre_pep08", 
"dre_pep99", "dre_pep99", "rno_pep24", "rno_pep35", "rno_pep20", 
"rno_pep27")), class = "data.frame", row.names = c(NA, -17L))

df02 <- structure(list(GenIDModel = c("gen_pep01", "gen_pep01", "gen_pep01", 
"gen_pep03", "gen_pep05"), GenIDOrganism1 = c("hsa_pep01", "hsa_pep02", 
"hsa_pep03", "hsa_pep11", "hsa_pep20")), class = "data.frame", row.names = c(NA, 
-5L))

P.S.: Please read how to make reproducible example.

P.P.S: Please read this, it is extremely important to write readable code, even if you think it is not a priority for you right now.

A match that gives me all the coincidences not just the first one

2 Answers2