I have two data frames, SCR and SpecificSpecies. The names of items in SCR contain in part the species listed in Specific Species.
SpecificSpecies$Species
S cerevisiae
Daucus carota
SCR$MESH_HEADINGS
tetracycline CMT-3
zrg17 protein, S cerevisiae
EP4 glycoprotein, Daucus carota
I am trying to get subset of SCR that contain just those entries which do not have any matching species. In the above case, that list would be just
tetracycline CMT-3.
The way I learned to do this would be using nested loops, comparing every entry in SCR to every entry in SpecificSpecies. When no match is found, append the row of SCR to a new table:
For each row in SCR {
SpeciesNumber <- 1
match <-NULL
while ((is.null(match)) & (SpeciesNumber < length(SpecificSpecies$Species))) {
if (grepl(SpecificSpecies$Species[SpeciesNumber], SCR[row,]$MESH_HEADING)){
match <- TRUE}
SpeciesNumber <- SpeciesNumber + 1}
if ((is.null(match) & SpeciesNumber == length(SpecificSpecies$Species)) {
speciesNoMatch = rbind(speciesNoMatch, SCR[row])}
}}
But this is excruciatingly slow with 65,000 entries in SCR and about 1500 in SpecificSpecies. Is there a way to nest like this with lapply? Or some other function that will help here that I am unfamiliar with?
I'm sure this is terrible code to begin with. I'm a medical librarian who has to use R sometimes for data analysis, so I have very limited programming skills to make do, but usually it doesn't matter if my solutions are ugly or inefficient as long as they eventually work. I know there must be a better way to do this; forgive me for being ignorant of something that is probably a simple solution.