Automated grep() across multiple columns in large dataset in R

Question

EDIT Reproducible example at the bottom...

I am working with a large dataset (pooled NHAMCS from the CDC):

> dim(ed0509) [1] 174020 514

I'm having trouble using grep() to identify rows in a data frame based on patterns in multiple column variables DIAG1 DIAG2 DIAG3 based on a vector list of interest SSTI.list. The condition is that if this pattern is identified in either one of column variables, then I want to pull that row number out to ultimately use this to subset the data to create a new categorical column SSTI.cat in the dataset (0 or 1).

SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684","685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", "7714", "7715", "7854", "9583", "99662", "99762", "9985")

Since I am dealing with a pretty long list >1000s of elements, I'm trying to automate this process using a for loop. The desired output is having new variables that contain the list of rows for each value in the vector SSTI.list. I have mainly having issues running a for loop within grep() and I get the error:

argument 'pattern' has length > 1 and only the first element will be used

What I have tried to do so far is:

diags <- c(ed0509$DIAG1,ed0509$DIAG2,ed0509$DIAG3)

for (i in SSTI.list){ assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F)) }

SSTI.comb would be the final list of rows (all of vari) that identified the patterns in SSTI.list from the for loop that would be used to create the categorical variable SSTI.cat

Then used the data.table package to create the categorical variable.

SSTI.comb<-sort(as.numeric(SSTI.comb))

setDT(ed0509)[SSTI.comb,SSTI.cat:=1][,SSTI.cat:=0]

EDIT for reproducibility, sorry about that...

DIAG1=c("00000","4659-","0356-","5664-","771--","7715-","78791")
DIAG2=c("3829-","00000","00000","4659-","7854-","00000","566--")
DIAG3=c("9985-","00000","00000","00000","00000","00000","00000")
df<-data.frame(DIAG1,DIAG2,DIAG3)`

SSTI.list <- c("035","9985","7854","771","7715")

for (i in SSTI.list){
assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F))
}

Conceptually I would like to have an output where the new column variable attached to df would indicate that the 1st, 3rd, 5th and 6th rows are identified to satisfy the pattern indicated in SSTI.list

DIAG1 DIAG2 DIAG3 SSTI.cat
1 00000 3829- 9985-        1
2 4659- 00000 00000        0
3 0356- 00000 00000        1
4 5664- 4659- 00000        0
5 771-- 7854- 00000        1
6 7715- 00000 00000        1
7 78791 566-- 00000        0

Can you make the question [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by adding an example of your data set and of your expected output dataset? — thepule, Aug 13 '16 at 19:40
Do you need an element in `SSTI.list` to match exactly with an element in the data, or does it only have match part of the value. For example, if an element in the data is `"683035"` should that result in a match with `"035"` from `SSTI.list`, or should only "035" in the data match "035" in `SSTI.list`? — eipi10, Aug 13 '16 at 20:15
@thepule I added a reproducible example and the desired output. @eipi10 I had used the `^` in `grep()` to indicate that I want the element based from the beginning of the string, so `"035"` from `SSTI.list` would identify `"0351-"` or `"03568"` but not `"683035"`. — EJJ, Aug 13 '16 at 20:28

eipi10 · Accepted Answer · 2016-08-13T20:35:08.723

Here's an example with fake data that I cooked up before you added your data. Let me know if this is what you had in mind:

SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684",
               "685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", 
               "7714", "7715", "7854", "9583", "99662", "99762", "9985")

# Fake data
set.seed(10)
dat = as.data.frame(replicate(5, sample(c(SSTI.list, 1e5:(1e5+1000)),10)), stringsAsFactors=FALSE)

       V1     V2     V3     V4     V5
1  100493 100642 100861 100522 100254
2  100286 100555 100604 100066 100206
3  100409 100087 100767 100145   7048
4  100682 100583 100336 100895 100719
5  100058 100338 100387 100404 100227
6  100202 100410 100695 100737 100136
7  100252 100024 100829 100813   7078
8  100249 100241 100216 100947 100468
9  100600 100378 100758 100671 100076
10 100998 100824 100334 100482 100789

# Match any instance of a pattern within any element of the data
dat[apply(dat, 1, function(i) any(grepl(paste(SSTI.list, collapse="|"), i))),]

      V1     V2     V3     V4     V5
3 100409 100087 100767 100145   7048
4 100682 100583 100336 100895 100719  # "100682 matches "682" in SSTI.list
7 100252 100024 100829 100813   7078

# Match only if a data element is exactly the same as one of the patterns.
dat[apply(dat, 1, function(i) any(grepl(paste(paste0("^",SSTI.list,"$"), collapse="|"), i))),]

      V1     V2     V3     V4   V5
3 100409 100087 100767 100145 7048
7 100252 100024 100829 100813 7078

If you just want the row indices of matching rows:

which(apply(dat, 1, function(i) any(grepl(paste(SSTI.list, collapse="|"), i))))

[1] 3 4 7

Yes, this is what I had wanted! I'm new to programming in R and I really appreciate your help with this. I was wondering if this is the only way to do this or if there are any existing packages that would do the equivalent? — EJJ, Aug 13 '16 at 20:45
There are almost always multiple ways to do things in R, and I'd be surprised if my answer is the most efficient. It's possible the `stringr`, `stringi`, or `data.table` packages have something simpler and/or faster, and there might be a better way in base R as well. Hopefully, someone will come along with other options. — eipi10, Aug 13 '16 at 20:48
Interesting, I'll look into those packages. I have another question, can `dat` in `apply()` be replaced with a vector that identifies specific columns in `dat`, in other words if I had only wanted to only consider `DIAG1` and `DIAG3`? or just `DIAG2`? — EJJ, Aug 13 '16 at 21:05
Then do `dat[ , c("DIAG1", "DIAG3", "DIAG5")]`. Or, more succinctly, `dat[ , paste0("DIAG", c(1,3,5)]`. Or, `dat[ , grep("1|3|5", names(dat))]` (note, the latter will also match "DIAG11" or "DIAG35", so you have to be careful about your pattern). Like I said, many ways to do things in R! — eipi10, Aug 13 '16 at 21:14

Automated grep() across multiple columns in large dataset in R

1 Answers1