5

I need to check if a string in one column contains a corresponding (numeric) value from the same row of another column, for all rows.

If I were only checking the string for a single pattern this would be straightforward using data.table's like or grepl. However, my pattern value is different for every row.

There's a somewhat related question here, but unlike that question I need to create a logical flag indicating if the pattern was present.

Let's say this is my dataset;

DT <- structure(list(category = c("administration", "nurse practitioner", 
                                  "trucking", "administration", "warehousing", "warehousing", "trucking", 
                                  "nurse practitioner", "nurse practitioner"), industry = c("admin", 
                                                                                            "truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
                                                                                            "truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                               -9L))
setDT(DT)
> DT
             category industry
1:     administration    admin
2: nurse practitioner    truck
3:           trucking    truck
4:     administration    admin
5:        warehousing    nurse
6:        warehousing    admin
7:           trucking    truck
8: nurse practitioner    nurse
9: nurse practitioner    truck

My desired result would be a vector like this:

> DT
   matches
1: TRUE
2: FALSE
3: TRUE
4: TRUE
5: FALSE
6: FALSE
7: TRUE
8: TRUE
9: FALSE

Of course, 1's and 0's would be just as good as TRUE and FALSE.

Here are some things I tried that didn't work:

apply(DT,1,grepl, pattern = DT[,2], x = DT[,1])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> apply(DT,1,grepl, pattern = DT[,1], x = DT[,2])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> grepl(DT[,2], DT[,1])
[1] FALSE

> DT[Vectorize(grepl)(industry, category, fixed = TRUE)]
             category industry
1:     administration    admin
2:           trucking    truck
3:     administration    admin
4:           trucking    truck
5: nurse practitioner    nurse

> DT[stringi::stri_detect_fixed(category, industry)]
             category industry
1:     administration    admin
2:           trucking    truck
3:     administration    admin
4:           trucking    truck
5: nurse practitioner    nurse

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1]))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1], fixed = T))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> DT[category %like% industry]
         category industry
1: administration    admin
2: administration    admin
Warning message:
In grepl(pattern, vector) :
  argument 'pattern' has length > 1 and only the first element will be used
Community
  • 1
  • 1
Hack-R
  • 22,422
  • 14
  • 75
  • 131

3 Answers3

7

In the OP's code, the , was not used. So, based on the data.table method, it will subset the rows that corresponds to the i index.

But, if we are specifying the , we are playing with the j and we get the logical vector as a result

DT[, stri_detect_fixed(category, industry)]
#[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE

Suppose, we keep it in a list, then we get a data.table with a column

DT[, list(match=stri_detect_fixed(category, industry))]
akrun
  • 874,273
  • 37
  • 540
  • 662
3

Or use:

apply(DT, 1, function(x) grepl(x[2], x[1],fixed=T))
count
  • 1,328
  • 9
  • 16
  • That works too. That was what I tried to do in the first example. I wonder why the way I indexed it broke it. I guess rows were implied by the margin in apply(). – Hack-R Feb 26 '16 at 20:20
3

I normally do:

DT[, flag := grepl(industry, category, fixed = TRUE), by = industry]
eddi
  • 49,088
  • 6
  • 104
  • 155