R: Search separate words in entire data frame to create a subset

Question

I have a data.frame with 11717 obs. of 15 variables. See below:

$ SCC                : Factor w/ 11717 levels "10100101","10100102",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Data.Category      : Factor w/ 6 levels "Biogenic","Event",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Short.Name         : Factor w/ 11238 levels "","2,4-D Salts and Esters Prod /Process Vents, 2,4-D Recovery: Filtration",..: 3283 3284 3293 3291 3290 3294 3295 3296 3292 3289 ...
$ EI.Sector          : Factor w/ 59 levels "Agriculture - Crops & Livestock Dust",..: 18 18 18 18 18 18 18 18 18 18 ...
$ Option.Group       : Factor w/ 25 levels "","C/I Kerosene",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Option.Set         : Factor w/ 18 levels "","A","B","B1A",..: 1 1 1 1 1 1 1 1 1 1 ...
$ SCC.Level.One      : Factor w/ 17 levels "Brick Kilns",..: 3 3 3 3 3 3 3 3 3 3 ...
$ SCC.Level.Two      : Factor w/ 146 levels "","Agricultural Chemicals Production",..: 32 32 32 32 32 32 32 32 32 32 ...
$ SCC.Level.Three    : Factor w/ 1061 levels "","100% Biosolids (e.g., sewage sludge, manure, mixtures of these matls)",..: 88 88 156 156 156 156 156 156 156 156 ...
$ SCC.Level.Four     : Factor w/ 6084 levels "","(NH4)2 SO4 Acid Bath System and Evaporator",..: 4455 5583 4466 4458 1341 5246 5584 5983 4461 776 ...
$ Map.To             : num  NA NA NA NA NA NA NA NA NA NA ...
$ Last.Inventory.Year: int  NA NA NA NA NA NA NA NA NA NA ...
$ Created_Date       : Factor w/ 57 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Revised_Date       : Factor w/ 44 levels "","1/27/2000 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Usage.Notes        : Factor w/ 21 levels ""," ","includes bleaching towers, washer hoods, filtrate tanks, vacuum pump exhausts",..: 1 1 1 1 1 1 1 1 1 1 ...

I am trying to make a search for the words "Combustion" and "Coal" and create a subset showing only where "Combustion" and "Coal" are combined in the same sentence OR the same row anywhere in the data.frame:

example of the words used in same sentence:

Fuel Comb - Electric Generation - Coal.

example of the words used in same row / different columns:

see screenshot (I don't have enough creds to attach a img). [screenshot][1]

Using RStudio search shows: 675 results for "Comb" and 251 results for "Coal". So the final combination should be equal or less than 251 if I'm correct.

I tried using grep and grepl. However the only way for me to use these functions is to repeat the process across each column before creating the subset (using match function for instance).

I find this to be a time consuming process. Would you have a better one?

[1]: https://i.stack.imgur.com/YJr5B.png

You should provide a reproducible example that people can copy and paste into their R session — talat, Feb 07 '18 at 20:14
I have no idea how to create a reproducible example. I'm also afraid that using simple vectors c(xxx,yyy,zzz) might not be helpful here as we're dealing with different classes. I'm happy to share where to download the data: https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip — euclideans, Feb 07 '18 at 20:29
Check this out: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — talat, Feb 07 '18 at 20:30

score 0 · Accepted Answer · answered Feb 07 '18 at 20:34

0

I am assuming that by "same sentence" you mean that the words your are searching for are within the same string in one column?

If so, given how I read your description of the problem, I am also assuming you just want to subset the data frame by the rows that contain both words whether they appear in the same sentence or in different columns of the same row. In either case it appears you just want to extract those rows that have both words.

If so, then one way you can do this is to concatenate all the columns together per row into one long string/sentence per row of the data frame and then grepl for your key words in the longer string (using one grepl per word/phrase). This worked for me on ~100k rows quickly (although I reduced the number of columns):

df <- data.frame(A=c("Comb","Comb Fuel","Comb",rep("None",1e5)),B=c("Fuel","Gas","None",rep("None",1e5)))
dfp <- do.call(paste,df)
df[grepl("Comb",dfp) & grepl("Fuel",dfp),]

answered Feb 07 '18 at 20:34

Daniel

1,291
6
15

1) Will it work regardless of how comb and coal are written? coal is written with the uppercase C and the lowercase C. 2) Can we use "|" (OR) in the vector definition c()? – euclideans Feb 07 '18 at 21:03
1) By default grepl is case sensitive, but you can modify it to not be case sensitive (e.g., grepl(...,ignore.case=T)). 2) Not sure what you mean. Do you mean could you rewrite the regex in grepl as one statement that ORs (|) the different terms? – Daniel Feb 07 '18 at 22:12
yes, instead of writing c("Comb",x,y,z): c("Comb|Combustion|Fuel"). Though I know the expression | can't be written in a vector c(). So the question is, is there a way to do this - with the purpose of saving space, writing the equation faster. – euclideans Feb 07 '18 at 22:44
Btw, your code above works... except it also picked up 'charcoal' in one instance out of 104. Is there a way to exclude in the code you wrote? I could of course subset it using the '-' symbol, but I don't find it really elegant. – euclideans Feb 07 '18 at 22:56
For the first part, just ORing them won't work since it won't guarantee that all the words appear in the row. If you need to extend it to an arbitrary set of words, you could write something along the lines of apply(do.call(rbind,lapply(c("word1","word2",...),function(w) grepl(w,dfp,ignore.case=T))),1,all). The second part can be accomplished by modifying the regex (e.g., "coal" to "\bcoal\b" would require that coal is a word by itself). – Daniel Feb 08 '18 at 04:05

R: Search separate words in entire data frame to create a subset

1 Answers1