2

I have two data frames, SCR and SpecificSpecies. The names of items in SCR contain in part the species listed in Specific Species.

SpecificSpecies$Species
S cerevisiae
Daucus carota

SCR$MESH_HEADINGS
tetracycline CMT-3 
zrg17 protein, S cerevisiae
EP4 glycoprotein, Daucus carota

I am trying to get subset of SCR that contain just those entries which do not have any matching species. In the above case, that list would be just

tetracycline CMT-3.

The way I learned to do this would be using nested loops, comparing every entry in SCR to every entry in SpecificSpecies. When no match is found, append the row of SCR to a new table:

For each row in SCR {
  SpeciesNumber <- 1
  match <-NULL
  while ((is.null(match)) & (SpeciesNumber < length(SpecificSpecies$Species))) {
  if (grepl(SpecificSpecies$Species[SpeciesNumber], SCR[row,]$MESH_HEADING)){
    match <- TRUE}
  SpeciesNumber <- SpeciesNumber + 1}
  if ((is.null(match) & SpeciesNumber == length(SpecificSpecies$Species)) {
    speciesNoMatch = rbind(speciesNoMatch, SCR[row])}
}}

But this is excruciatingly slow with 65,000 entries in SCR and about 1500 in SpecificSpecies. Is there a way to nest like this with lapply? Or some other function that will help here that I am unfamiliar with?

I'm sure this is terrible code to begin with. I'm a medical librarian who has to use R sometimes for data analysis, so I have very limited programming skills to make do, but usually it doesn't matter if my solutions are ugly or inefficient as long as they eventually work. I know there must be a better way to do this; forgive me for being ignorant of something that is probably a simple solution.

NotMyJob
  • 93
  • 1
  • 4

2 Answers2

0

I think !(%in%) will do the trick:

SpecificSpecies <- data.frame(
  Species = c("S cerevisiae", "Daucus carota"),
  stringsAsFactors = FALSE
)

SCR <- data.frame(
  MESH_HEADINGS = c("tetracycline CMT-3", "zrg17 protein", "S cerevisiae", 
                    "EP4 glycoprotein", "Daucus carota"),
  stringsAsFactors = FALSE
)


SCR[!(SCR$MESH_HEADINGS %in% SpecificSpecies$Species), , drop = FALSE]
#        MESH_HEADINGS
# 1 tetracycline CMT-3
# 2      zrg17 protein
# 4   EP4 glycoprotein

The , , drop = ... isn't a typo. The first , ensures all columns/variables are returned. The second , drop = FALSE ensures the returned result is still a data frame.

Correction

Ok, I've just noticed you're looking to grep with the Species. The following code should work:

SpecificSpecies <- data.frame(
  Species = c("S cerevisiae", "Daucus carota"),
  stringsAsFactors = FALSE
)

SCR <- data.frame(
  MESH_HEADINGS = c("tetracycline CMT-3",
                    "zrg17 protein, S cerevisiae", 
                    "EP4 glycoprotein, Daucus carota"),
  stringsAsFactors = FALSE
)

matching <- lapply(SpecificSpecies$Species, function(x) {
  grep(x, SCR$MESH_HEADINGS)
})

SCR[-(unlist(matching)), ]
#        MESH_HEADINGS
# 1 tetracycline CMT-3

The lapply() uses an anonymous function to identify pattern matches. It loops through every species and compares it to every SCR$MESH_HEADINGS item. It returns a list of matched indices.

The subset ([]) simply drops the matched indices (-) after we've first unlisted the matched indices to make it compatible with the subset function.

Phil
  • 4,344
  • 2
  • 23
  • 33
  • No, I'm afraid not. Those commas are not delimiters, they are part of the original string. That is, some of the entries take the form "Protien name, species name." I need to use grep or regex or something like it to provide the matches, since the species will only match part of the SCR$MESH_HEADING. – NotMyJob Oct 24 '16 at 16:56
  • Thank you for your help with this! Two things. First, your example works for me, but fails when I try to use my actual data. I'm getting a list of 0s and then an empty table on the unlist. Any idea why? Second, I need the entire record from SCR when I subset, not just the MESH_HEADING. Is there an easy way to make this return the entire row? – NotMyJob Oct 24 '16 at 18:50
  • It should do these things already. Can you post a subset of your data frames, say the first 20 rows or so, by editing your question. Use `dput()`. See this post for help: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Phil Oct 24 '16 at 19:54
  • @NotMyJob Sorry I forgot to @ mention you: can you add some example data (see my previous comment). We'll get it working! – Phil Oct 26 '16 at 12:34
0

Main idea:

Doing the loop on SpecificSpecies as it has less row. Since the SCR dataframe will be reduce, do it recursively, so the loop work on less data each time.

In general the packages data.table or plyr increase performance. Here the solution with data.table

    library(data.table)
SpecificSpecies <- data.frame(Species = c("S cerevisiae", "Daucus carota"),stringsAsFactors = FALSE)
SCR <- data.frame(MESH_HEADINGS = c("tetracycline CMT-3", "zrg17 protein, S cerevisiae","EP4 glycoprotein Daucus carota"),stringsAsFactors = FALSE)

dt_temp <- data.table(SCR)
for (species in SpecificSpecies$Species) {
  dt_temp <- dt_temp[!grepl(species,dt_temp$MESH_HEADINGS), ]
}
dt_result <- dt_temp
dt_result
timat
  • 1,480
  • 13
  • 17