1

Say I have a data.table where C columns hold discrete values among N possible values:

set.seed(123)
datapoints = data.table(replicate(3, sample(0:5, 4, rep=TRUE)))
print(datapoints)
   V1 V2 V3
1:  1  5  3
2:  4  0  2
3:  2  3  5
4:  5  5  2

(here C=3 and N=5)

I want to add N columns, each containing TRUE if one of the C columns contains the Nth value, FALSE otherwise:

   V1 V2 V3  has0  has1  has2  has3  has4  has5
1:  1  5  3 FALSE  TRUE FALSE  TRUE FALSE  TRUE
2:  4  0  2  TRUE FALSE  TRUE FALSE  TRUE FALSE
3:  2  3  5 FALSE FALSE  TRUE  TRUE FALSE  TRUE
4:  5  5  2 FALSE FALSE  TRUE FALSE FALSE  TRUE

I have tried this:

for (value in 0:5) {
  datapoints <- datapoints[, (paste("has", value, sep="")) := (value %in% .SD), .SDcols = c("V1", "V2", "V3")]
}

The columns are added but filled with FALSE:

   V1 V2 V3  has0  has1  has2  has3  has4  has5
1:  1  5  3 FALSE FALSE FALSE FALSE FALSE FALSE
2:  4  0  2 FALSE FALSE FALSE FALSE FALSE FALSE
3:  2  3  5 FALSE FALSE FALSE FALSE FALSE FALSE
4:  5  5  2 FALSE FALSE FALSE FALSE FALSE FALSE

It seems to me that the code would work if I replaced .SD with a reference to the current row (instead of the whole table), but I don't know how to do that.

What is an efficient way of adding these columns?

Hey
  • 1,701
  • 5
  • 23
  • 43
  • 3
    Please post some data that we can use to replicate the error and help you better. – cropgen Apr 22 '19 at 18:28
  • I added a reproducible example. – Hey Apr 22 '19 at 18:57
  • 1
    Related: [How to one-hot-encode factor variables with data.table?](https://stackoverflow.com/questions/39905820/how-to-one-hot-encode-factor-variables-with-data-table) – Henrik Apr 22 '19 at 20:18

1 Answers1

1

Here is one way

library(data.table)

# sample data
set.seed(123)
datapoints = data.table(replicate(3, sample(0:5, 4, rep=TRUE)))

# find if value exists
for(value in 0:5) {
  datapoints[, paste("has", value, sep="") := apply(.SD, 1, function(x) any(x %in% value)), .SDcols = c("V1", "V2", "V3")]
}

datapoints
#>    V1 V2 V3  has0  has1  has2  has3  has4  has5
#> 1:  1  5  3 FALSE  TRUE FALSE  TRUE FALSE  TRUE
#> 2:  4  0  2  TRUE FALSE  TRUE FALSE  TRUE FALSE
#> 3:  2  3  5 FALSE FALSE  TRUE  TRUE FALSE  TRUE
#> 4:  5  5  2 FALSE FALSE  TRUE FALSE FALSE  TRUE

To make is more flexible you can also replace any(x %in% value) with sum(x %in% value) to get how many times that value appears per row. For the same example

# find how many times a value exists
for(value in 0:5) {
  datapoints[, paste("has", value, sep="") := apply(.SD, 1, function(x) sum(x %in% value)), .SDcols = c("V1", "V2", "V3")]
}

datapoints
#>    V1 V2 V3 has0 has1 has2 has3 has4 has5
#> 1:  1  5  3    0    1    0    1    0    1
#> 2:  4  0  2    1    0    1    0    1    0
#> 3:  2  3  5    0    0    1    1    0    1
#> 4:  5  5  2    0    0    1    0    0    2

And of course, you can still use your .SDcols if you only want a subset of columns.

cropgen
  • 1,920
  • 15
  • 24