I have two data frames: SCR and matchedSCR. They each contain a list on protein headings. matchedSCR is a subset of SCR, created directly from SCR. The strings for the matchedSCR protein headings should thus be identical to their counterparts in SCR and be able to serve as an index that links them. However, when I try to match the records up, only a small portion of them match, no matter what method I use. The following all match about 6000 of what should be 17000 records.
subset(SCR, (SCR$MESH_HEADING %in% matchedSCR$Heading))
SCR[SCR$MESH_HEADING %in% matchedSCR$Heading, ]
sqldf("select * from SCR join matchedSCR on SCR.MESH_HEADING=matchedSCR.Heading")
What is maddening is that I can find a missing line and match it by hand!
if(SCR$MESH_HEADING[64] == matchedSCR$Heading[2]) {print("T")}
[1] "T"
Matching SCR to a different subset dataframe, orthologSCR, created in almost precisely the same way from SCR, works perfectly, so I assume the problem is somehow with matchedSCR, but I cannot figure out why. It's just a single column of characters (not factors) like:
VisA protein, Streptomyces virginiae
VisB protein, Streptomyces virginiae
VisC protein, Streptomyces virginiae
VisD protein, Streptomyces virginiae
subpeptin JM-A, Bacillus subtilis
subpeptin JM-B, Bacillus subtilis
BT peptide antibiotic, Brevibacillus texasporus
LI-Fb peptide, Paenibacillus polymyxa
Can anyone suggest reasons these character comparisons might be failing? Would special characters trip things up for any reason in here? (They don't seem to matter when matching to the other subset data frame that is working.) What I really need is the unmatched data from SCR. I can generate this right now with an incredibly slow process based on the opposite of the complex selection that created matchedSCR, but I would really like to learn from the error I'm getting here so I don't encounter this again.