EDIT Reproducible example at the bottom...
I am working with a large dataset (pooled NHAMCS from the CDC):
> dim(ed0509)
[1] 174020 514
I'm having trouble using grep()
to identify rows in a data frame based on patterns in multiple column variables DIAG1
DIAG2
DIAG3
based on a vector list of interest SSTI.list
. The condition is that if this pattern is identified in either one of column variables, then I want to pull that row number out to ultimately use this to subset the data to create a new categorical column SSTI.cat
in the dataset (0 or 1).
SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684","685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", "7714", "7715", "7854", "9583", "99662", "99762", "9985")
Since I am dealing with a pretty long list >1000s of elements, I'm trying to automate this process using a for loop. The desired output is having new variables that contain the list of rows for each value in the vector SSTI.list
. I have mainly having issues running a for loop within grep()
and I get the error:
argument 'pattern' has length > 1 and only the first element will be used
What I have tried to do so far is:
diags <- c(ed0509$DIAG1,ed0509$DIAG2,ed0509$DIAG3)
for (i in SSTI.list){
assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F))
}
SSTI.comb
would be the final list of rows (all of var
i) that identified the patterns in SSTI.list
from the for loop that would be used to create the categorical variable SSTI.cat
Then used the data.table
package to create the categorical variable.
SSTI.comb<-sort(as.numeric(SSTI.comb))
setDT(ed0509)[SSTI.comb,SSTI.cat:=1][,SSTI.cat:=0]
EDIT for reproducibility, sorry about that...
DIAG1=c("00000","4659-","0356-","5664-","771--","7715-","78791")
DIAG2=c("3829-","00000","00000","4659-","7854-","00000","566--")
DIAG3=c("9985-","00000","00000","00000","00000","00000","00000")
df<-data.frame(DIAG1,DIAG2,DIAG3)`
SSTI.list <- c("035","9985","7854","771","7715")
for (i in SSTI.list){
assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F))
}
Conceptually I would like to have an output where the new column variable attached to df
would indicate that the 1st, 3rd, 5th and 6th rows are identified to satisfy the pattern indicated in SSTI.list
DIAG1 DIAG2 DIAG3 SSTI.cat
1 00000 3829- 9985- 1
2 4659- 00000 00000 0
3 0356- 00000 00000 1
4 5664- 4659- 00000 0
5 771-- 7854- 00000 1
6 7715- 00000 00000 1
7 78791 566-- 00000 0