5

I have a sparse data table that looks like this:

 data = data.table(
    var1 = c("a","",""),
    var2 = c("","","c"),
    var3 = c("a","b",""),
    var4 = c("","b","")
)
      var1 var2 var3 var4
    1:    a         a     
    2:              b    b
    3:         c     

I would like to add a column that contains a string of zeros and ones indicating which variables are present in any row, like this:

  var1 var2 var3 var4  concat
1:    a         a      1|0|1|0
2:              b    b 0|0|1|1
3:         c           0|1|0|0

I can get to this with the following command:

data[, concat := paste(
           as.integer(var1 != ""),
           as.integer(var2 != ""),
           as.integer(var3 != ""),
           as.integer(var4 != ""),
           sep = "|")]

However, if I have hundreds of variables, I would rather use some sort of computation to get to the desired expression. Perhaps something based on paste0("var",1:4), or at least a vector of column names. Any suggestions?

BigFinger
  • 1,033
  • 6
  • 7

5 Answers5

4

Same basic approach as above:

data[ , concat := apply(.SD, 1, function(x) paste(+(x == ""), collapse = "|"))][]
#    var1 var2 var3 var4  concat
# 1:    a         a      0|1|0|1
# 2:              b    b 1|1|0|0
# 3:         c           1|0|1|1
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • I'm really surprised that looping over each row is faster than `paste`-ing them together as columns. But it seems to be the case regardless of what I think should happen! :-P – thelatemail Aug 12 '16 at 01:55
4

Variation not requiring any grouping by each row or apply-ing on each row.

data[, concat := do.call(paste, c(lapply(.SD, function(x) (x!="")+0 ), sep="|")) ]

#   var1 var2 var3 var4  concat
#1:    a         a      1|0|1|0
#2:              b    b 0|0|1|1
#3:         c           0|1|0|0
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • 1
    Instead of all this equality testing followed by coercion (in every answer), one could use `nchar` I guess. – Frank Aug 11 '16 at 23:45
  • 1
    @Frank - true, though that assumes that the valid values in the real data are all just one character. If not, I suppose you could do `pmin` or something to cap it, but that's just adding complexity again. – thelatemail Aug 11 '16 at 23:51
2
 data$concat <- apply(apply(data, 2, function(x) ifelse(x == "", 0, 1)), 1, function(x) paste(x, collapse="|"))

Breakdown:

1) For each column in data, check if element is empty, if so return 0, else 1

apply(data, 2, function(x) ifelse(x == "", 0, 1))

Let's call the return from (1) the variable concat. For each row of concat, paste everything together with a pipe (|) separating them. Set the new column of data to equal this.

apply(concat, 1, function(x) paste(x, collapse="|"))
TomNash
  • 3,147
  • 2
  • 21
  • 57
  • Good, but suggestions: no reason not to just do the whole thing row-wise; if going column-wise _were_ called for, the approach would be to use `lapply(.SD, "==", "")`; [`ifelse` is slow](http://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow), so avoid it; and, this having the `[data.table]` tag, you should be [assigning by reference with `:=`](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-reference-semantics.html) – MichaelChirico Aug 11 '16 at 23:24
  • Was gonna ask you about the `.SD`, never seen that before, I like your answer better, but I just posted what I worked out on my own that worked. – TomNash Aug 11 '16 at 23:35
2

Thanks for the many and so diverse solutions. I am impressed!

I did some benchmarking on my large dataset to compare running time of several of the different approaches. Here is what I found:

data[ , concat := apply(.SD, 1, function(x) paste(+(x == ""), collapse = "|"))]

Time: 6 min, 41 s

data[, concat := do.call(paste, c(lapply(.SD, function(x) (x!="")+0 ), sep="|")) ]

Time: 10 min, 26 s

data[,concat := paste0(as.integer(.SD != ""), collapse = "|"), by = 1:nrow(data)]

Time: > 40 min (manually killed)

BigFinger
  • 1,033
  • 6
  • 7
1

Another option is to group the data by row and paste each row together:

data[,concat := paste0(as.integer(.SD != ""), collapse = "|"), by = 1:nrow(data)]
data
#   var1 var2 var3 var4  concat
#1:    a         a      1|0|1|0
#2:              b    b 0|0|1|1
#3:         c           0|1|0|0
Psidom
  • 209,562
  • 33
  • 339
  • 356