Creating dummy variable from list in a data.table in R

Question

I have a data table like this:

> x
   part        colig
 1:   PR     PT, PMDB
 2: PMDB     PT, PMDB
 3: PMDB     PT, PMDB
 4:  PDT     PT, PMDB
 5: PMDB     PT, PMDB
 6:  PFL PSDB,PFL,PTB
 7:  PPB PSDB,PFL,PTB
 8: PMDB PSDB,PFL,PTB
 9: PMDB PSDB,PFL,PTB
10:  PPB PSDB,PFL,PTB
> str(x)
Classes ‘data.table’ and 'data.frame':  10 obs. of  2 variables:
 $ part : chr  "PR" "PMDB" "PMDB" "PDT" ...
 $ colig:List of 10
  ..$ : chr "PT" "PMDB"
  ..$ : chr "PT" "PMDB"
  ..$ : chr "PT" "PMDB"
  ..$ : chr "PT" "PMDB"
  ..$ : chr "PT" "PMDB"
  ..$ : chr  "PSDB" "PFL" "PTB"
  ..$ : chr  "PSDB" "PFL" "PTB"
  ..$ : chr  "PSDB" "PFL" "PTB"
  ..$ : chr  "PSDB" "PFL" "PTB"
  ..$ : chr  "PSDB" "PFL" "PTB"
 - attr(*, ".internal.selfref")=<externalptr>

and I want to create a dummy variable that is 1 when the first variable is contained in the second. My desired output is:

> x
    part        colig dummy
 1:   PR     PT, PMDB FALSE
 2: PMDB     PT, PMDB  TRUE
 3: PMDB     PT, PMDB  TRUE
 4:  PDT     PT, PMDB FALSE
 5: PMDB     PT, PMDB  TRUE
 6:  PFL PSDB,PFL,PTB  TRUE
 7:  PPB PSDB,PFL,PTB FALSE
 8: PMDB PSDB,PFL,PTB FALSE
 9: PMDB PSDB,PFL,PTB FALSE
10:  PPB PSDB,PFL,PTB FALSE

My problem is accessing the elements inside the list in the second column. I'm trying something like:

x[, dummy := x[,part] %in% x[, colig]]

or

x[, dummy := x[,part] %in% unlist(x[, colig])]

The two options are wrong. In the first case, the dummy is always FALSE, and in the second, the unlist() command creates a list with elements from all the lists (not only from the respective row).

I also tried with lapply (like here Creating dummy variables in R data.table):

x[, dummy := lapply( x[,part], function(y) y %in% unlist(x[,colig]))]

which I think is correct, but I am having problems with speed because I have a lot of rows.

Is there any faster option?

eddi · Answer 1 · 2015-07-30T20:56:34.340

2

Use grepl and do it by each value of "part":

x[, dummy := grepl(part, colig), by = part]

Upon second reading of OP, I'm not sure what's going on in that column - looks like some of the elements are lists and others are characters. The above will work for characters (and you can squeeze in lapply(colig, toString) somewhere to convert the list to strings).

edited Jul 30 '15 at 20:56

answered Jul 30 '15 at 20:42

eddi

49,088
6
104
155

Rorschach · Answer 2 · 2015-07-30T20:09:19.793

1

Try with stringi, it should be fast.

library(stringi)
x$dummy = stri_detect(x[,"colig"], fixed=x[,"part"])
#    part        colig dummy
# 2    PR     PT, PMDB FALSE
# 3  PMDB     PT, PMDB  TRUE
# 4  PMDB     PT, PMDB  TRUE
# 5   PDT     PT, PMDB FALSE
# 6  PMDB     PT, PMDB  TRUE
# 7   PFL PSDB,PFL,PTB  TRUE
# 8   PPB PSDB,PFL,PTB FALSE
# 9  PMDB PSDB,PFL,PTB FALSE
# 10 PMDB PSDB,PFL,PTB FALSE
# 11  PPB PSDB,PFL,PTB FALSE

or as data.table

setDT(x)[, dummy := stri_detect(colig, fixed=part)]

Edit

If you a mixture of lists and unseparated strings as it appears you might, try something like

setDT(x)[, dummy := any(stri_detect(colig, fixed=part)), by=1:nrow(x)]

edited Jul 30 '15 at 20:09

answered Jul 30 '15 at 19:48

Rorschach

31,301
5
78
129

Thanks! It worked but I'm having the same problem I described below: there are rows that part = "PT" and colig = "PTB", or part = "PP" and colig = "PPB", and I get TRUE instead of FALSE. – Bruno Mazzucco Jul 30 '15 at 20:35
In that case you can use a regex. So, instead of `fixed=part`, use `regex=` and construct a regex that works for all your cases. – Rorschach Jul 30 '15 at 20:46

A. Webb · Answer 3 · 2015-07-30T21:00:43.223

0

From your str(x) output, you seem to have some problems with your data. The first few rows of colig do not appear to be split. In other words, you probably mean to have the two elements "PT", "PMDB" rather than the single element "PT, PMDB". This may be part of the problem. Apply strsplit as necessary.

If your sample is representative, then doing simply

apply(x,1,function(x) x$part %in% x$colig)

where x is just a data.frame should be plenty fast. I replicated a corrected version of your x to 100000 rows and this ran in a fraction of a second.

edited Jul 30 '15 at 21:00

answered Jul 30 '15 at 19:58

A. Webb

26,227
1
63
95

Thanks! I corrected the problem with the strings, but I'm afraid my data were not representative. Your answer worked fine with all the cases except the ones the values in the two columns overlap. There are rows that part = "PT" and colig = "PTB", or part = "PP" and colig = "PPB", and I get TRUE instead of FALSE. – Bruno Mazzucco Jul 30 '15 at 20:32

Creating dummy variable from list in a data.table in R

3 Answers3

Edit