Looping grepl() through data.table (R)

Question

I have a dataset stored as a data.table DT that looks like this:

print(DT)
   category            industry
1: administration      admin
2: nurse practitioner  truck
3: trucking            truck
4: administration      admin
5: warehousing         nurse
6: warehousing         admin
7: trucking            truck
8: nurse practitioner  nurse         
9: nurse practitioner  truck

I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl() to regex match the string '^{{INDUSTRY}}[a-z ]+$' and each row of DT$category, with each corresponding row of DT$industry inserted in place of {{INDUSTRY}} in the regex string using infuse(). I struggled to find a sleek data.table solution that would properly loop through the table and make within-row comparisons, so I resorted to a for-loop to get the job done:

template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
    ind <- DT[i]$industry
    categ <- d.daily[i]$category
    if (grepl(infuse(IND=ind,template),categ)){
        DT[i]$match <- TRUE
    }
}
DT<- DT[match==TRUE]
print(DT)
       category            industry
1: administration      admin
2: trucking            truck
3: administration      admin
4: trucking            truck
5: nurse practitioner  nurse

However, I am sure this can be done in a better way. Any suggestions for how I could achieve this result by utilizing the data.table package's functionality? It's my understanding that, in this context, an approach that uses the package would likely be more efficient than a for-loop.

The parameters of your true use case are not clear, but I imagine `by=industry` or `by=.(industry, category)` might help (in either of the answers below), reducing the number of comparisons needed. — Frank, Nov 13 '15 at 18:43
what @Frank said - do a regular `grep` by industry - I'm pretty sure that'll be much faster than the `stringr` answer (and is obviously more general than the substring one) - `dt[dt[, grepl(industry, category), by = industry]$V1]` — eddi, Nov 13 '15 at 19:19
@eddi The `by=` might reorder the data, making the logical subsetting wrong, Maybe `DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]` from http://stackoverflow.com/a/16574176/1191259 — Frank, Nov 13 '15 at 19:28

Rich Scriven · Answer 1 · 2015-11-13T22:14:37.393

You could use stringi::stri_detect_fixed(). It is vectorized over both str and pattern.

DT[stringi::stri_detect_fixed(category, industry)]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse

Alternatively, stringr::str_detect() can be used. It is also vectorized over both its arguments.

library(stringr)
DT[str_detect(category, fixed(industry))]

Or a base R option is to run grepl() through mapply()

DT[mapply(grepl, industry, category, fixed = TRUE)]

Or you could get the same result with Vectorize(grepl).

DT[Vectorize(grepl)(industry, category, fixed = TRUE)]

All of these produce the same result.

Data:

DT <- structure(list(category = c("administration", "nurse practitioner", 
"trucking", "administration", "warehousing", "warehousing", "trucking", 
"nurse practitioner", "nurse practitioner"), industry = c("admin", 
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
-9L))
setDT(DT)

Any judgment which of the two is "better"? Seems like vectorization should beat loop hiding, but I'm not sure. — Frank, Nov 13 '15 at 18:25
@Frank - I would guess that `mapply()` is better since it does less checks. — Rich Scriven, Nov 13 '15 at 18:27
Great use of the vectorised matching in stringi - it's perfect for this. — Ken Benoit, Nov 13 '15 at 19:23

score 7 · Answer 2 · answered Nov 13 '15 at 18:31

As long as the match is always based on the start of the category string, then this works just fine:

dt[substring(category, 1, nchar(industry)) == industry]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse

score 6 · Accepted Answer · edited May 23 '17 at 12:32

Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:

DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]

This uses the current idiom for subsetting by group, thanks to @eddi .

Comments. These might help further:

If you have many rows with the same industry-category combo, try by=.(industry,category).
Try something else in the place of grep (like the options in Ken and Richard's answers).

Looping grepl() through data.table (R)

3 Answers3

Linked

Related