1

I have two data frames: SCR and matchedSCR. They each contain a list on protein headings. matchedSCR is a subset of SCR, created directly from SCR. The strings for the matchedSCR protein headings should thus be identical to their counterparts in SCR and be able to serve as an index that links them. However, when I try to match the records up, only a small portion of them match, no matter what method I use. The following all match about 6000 of what should be 17000 records.

subset(SCR, (SCR$MESH_HEADING %in% matchedSCR$Heading)) 
SCR[SCR$MESH_HEADING %in% matchedSCR$Heading, ] 
sqldf("select * from SCR join matchedSCR on SCR.MESH_HEADING=matchedSCR.Heading")

What is maddening is that I can find a missing line and match it by hand!

if(SCR$MESH_HEADING[64] == matchedSCR$Heading[2]) {print("T")} 
[1] "T"

Matching SCR to a different subset dataframe, orthologSCR, created in almost precisely the same way from SCR, works perfectly, so I assume the problem is somehow with matchedSCR, but I cannot figure out why. It's just a single column of characters (not factors) like:

VisA protein, Streptomyces virginiae
VisB protein, Streptomyces virginiae
VisC protein, Streptomyces virginiae
VisD protein, Streptomyces virginiae
subpeptin JM-A, Bacillus subtilis
subpeptin JM-B, Bacillus subtilis
BT peptide antibiotic, Brevibacillus texasporus
LI-Fb peptide, Paenibacillus polymyxa

Can anyone suggest reasons these character comparisons might be failing? Would special characters trip things up for any reason in here? (They don't seem to matter when matching to the other subset data frame that is working.) What I really need is the unmatched data from SCR. I can generate this right now with an incredibly slow process based on the opposite of the complex selection that created matchedSCR, but I would really like to learn from the error I'm getting here so I don't encounter this again.

rawr
  • 20,481
  • 4
  • 44
  • 78
NotMyJob
  • 93
  • 1
  • 4

1 Answers1

1

You might be having some white characters before or after... you can try: How to trim leading and trailing whitespace in R?

Also you could try converting everything to lower case. You can do that with the tm library.

Probably the best thing to do in your case just to see what's going on is:

library(dplyr) SCR$Heading <- SCR$MESH_HEADING full_join(SCR,matchedSCR, by=Heading) %>% View

Investigate that df and see which matches were made and which weren't... that'll help you understand the problem. You can also try anti_join to only see the unmatched records.

Worse case, check out https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf

Community
  • 1
  • 1
Amit Kohli
  • 2,860
  • 2
  • 24
  • 44
  • Thank you! I will try your suggestions. : – NotMyJob Oct 20 '16 at 18:08
  • @NotMyJob, it's customary in this community to attempt the solution given, and if it works, mark as "answered", and if you want to give me thanks, you upvote my answer (commenting to say thanks is frowned upon). If it doesn't work, you would write a comment specifying why not. Remember, this should be useful for others after you! – Amit Kohli Oct 20 '16 at 18:45