Dummy variables from a string variable

Question

I would like to create dummy variables form this dataset:

DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6", 
  "3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"), 
              row.names = c(NA, 5L), class = "data.frame")
> DF
  A                  B
1 1              1,3,2
2 2            2,1,3,6
3 3          3,2,5,1,7
4 4        3,7,4,2,6,5
5 5       4,10,7,3,5,6

Desired output shoud look like this:

A  1  2  3  4  5  6  7  8  9  10
1  1  1  1  0  0  0  0  0  0  0
2  1  1  1  0  0  1  0  0  0  0
3  1  1  1  0  1  0  1  0  0  0
4  0  1  1  1  1  1  1  0  0  0
5  0  0  1  1  1  1  1  0  0  1

Is there a efficient way to do such thing? I can use strsplit or ifelse. Original dataset is very large with many rows (>10k) and values in column B (>15k). Function dummy from package dummies don't work as I want to.

I also found simmilar case: Splitting one column into multiple columns. But the anwsers from the link above work really slow in my case (up to 15 minutes on my Dell i7-2630QM, 8Gb, Win7 64 bit, R 2.15.3 64bit).

Thank you in advance for your anwsers.

Please see @Ananda Mahto 's solution. I think it takes the cake! Between 2x to 5x faster than the other solutions. — Ricardo Saporta, Apr 29 '13 at 05:46

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2013-10-06T17:19:47.763

UPDATE

The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.

Install it, and use concat.split.expanded to get the desired result:

library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
#   A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1    1    1    1    0    0    0    0    0    0    0
# 2 2    1    1    1    0    0    1    0    0    0    0
# 3 3    1    1    1    0    1    0    1    0    0    0
# 4 4    0    1    1    1    1    1    1    0    0    0
# 5 5    0    0    1    1    1    1    1    0    0    1

Original post

A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split(), can be found here.

The usage, for your example data, would be:

## Keeping the original column
concat.split(DF, "B", structure="expanded")
#   A            B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1        1,3,2   1   1   1  NA  NA  NA  NA  NA  NA   NA
# 2 2      2,1,3,6   1   1   1  NA  NA   1  NA  NA  NA   NA
# 3 3    3,2,5,1,7   1   1   1  NA   1  NA   1  NA  NA   NA
# 4 4  3,7,4,2,6,5  NA   1   1   1   1   1   1  NA  NA   NA
# 5 5 4,10,7,3,5,6  NA  NA   1   1   1   1   1  NA  NA    1

## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
#   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1   1   1   1  NA  NA  NA  NA  NA  NA   NA
# 2 2   1   1   1  NA  NA   1  NA  NA  NA   NA
# 3 3   1   1   1  NA   1  NA   1  NA  NA   NA
# 4 4  NA   1   1   1   1   1   1  NA  NA   NA
# 5 5  NA  NA   1   1   1   1   1  NA  NA    1

Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)

temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
#   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1   1   1   1   0   0   0   0   0   0    0
# 2 2   1   1   1   0   0   1   0   0   0    0
# 3 3   1   1   1   0   1   0   1   0   0    0
# 4 4   0   1   1   1   1   1   1   0   0    0
# 5 5   0   0   1   1   1   1   1   0   0    1

Update

Most of the overhead in the concat.split function probably comes in things like converting from a matrix to a data.frame, renaming the columns, and so on. The actual code used to do the splitting is a GASP for loop, but test it out, and you'll find that it performs pretty well:

b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)      
## Fill it in
for (i in 1:nrow(DF)) {
  m[i, temp[[i]]] = 1
}
## View your result
m

very nice. I remember this from an older question http://stackoverflow.com/questions/13545547/ — Ricardo Saporta, Apr 29 '13 at 05:25
@RicardoSaporta, preallocation definitely helps, I suppose :) — A5C1D2H2I1M1N2O1R2T1, Apr 29 '13 at 05:49
certainly, but data.table preallocates as well and this is still much faster. See the updated benchmarks for larger dataset — Ricardo Saporta, Apr 29 '13 at 05:58
@RicardoSaporta, then the other possible explanation I can think of is that R generally performs faster with matrices than it does with `data.frame`s, and `data.table` is more like an optimized `data.frame`, and not an optimized `matrix`. (But I honestly don't know much at all about the inner workings of these things, so now I'm literally just pulling ideas out of thin air). — A5C1D2H2I1M1N2O1R2T1, Apr 29 '13 at 06:39
I've got a 20% speed gain (in median) by adding `fixed=TRUE` to your `strsplit` call (tested with the 10,000 x 600 DF in @RicardoSaporta answer). — Ferdinand.kraft, Apr 30 '13 at 03:22

score 10 · Answer 2 · edited Jun 09 '13 at 14:07

Update:

Added benchmarks below
Update2: added bechmarks for @Anada's solution. WOW it's fast!! Added benchmarks for an evern larger data set and @Anada's solution speeds ahead by a larger margin. '

Original Answer: As you can see below, KnownMax and UnknownMax are outperforming even the data.table solution. Although, I suspect that if there were 10e6+ rows, then the data.table solution would be fastest. (feel free to benchmark it by simply modifying the parameters at the very bottom of this post)

Solution 1: `KnownMax`

If you know the maximum value in B, then you have a nice, two-liner:

maximum <- 10
results <- t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    1    1    1    0    0    0    0    0    0     0
# [2,]    1    1    1    0    0    1    0    0    0     0
# [3,]    1    1    1    0    1    0    1    0    0     0
# [4,]    0    1    1    1    1    1    1    0    0     0
# [5,]    0    0    1    1    1    1    1    0    0     1

Three lines, if you want to name the columns and rows:

dimnames(results) <- list(seq(nrow(results)), seq(ncol(results)))

Solution 2: `UnknownMax`

# if you do not know the maximum ahead of time: 
splat <- strsplit(DF$B, ",")
maximum <- max(as.numeric(unlist(splat)))
t(sapply(splat, `%in%`, x=1:maximum)) + 0

Solution 3: `DT`

As per @dickoa's request, here is an option with data.table. '

DT <- data.table(DF)

DT.long <- DT[,  list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]

cols <- DT.long[, max(vals)]
rows <- DT.long[, max(A)] 

matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols,
       byrow=TRUE, dimnames=list(seq(rows), seq(cols)))

#   1 2 3 4 5 6 7 8 9 10
# 1 1 1 1 0 0 0 0 0 0  0
# 2 1 1 1 0 0 1 0 0 0  0
# 3 1 1 1 0 1 0 1 0 0  0
# 4 0 1 1 1 1 1 1 0 0  0
# 5 0 0 1 1 1 1 1 0 0  1

Similar setup can be done in base R as well

===

Here are some benchmarks with slightly larger data:

microbenchmark(KnownMax = eval(KnownMax), UnknownMax = eval(UnknownMax),
    DT.withAssign = eval(DT.withAssign),
    DT.withOutAssign = eval(DT.withOutAssign),
    lapply.Dickoa = eval(lapply.Dickoa), apply.SimonO101 = eval(apply.SimonO101),
    forLoop.Ananda = eval(forLoop.Ananda), times=50L)

Using the OP data.frame, where the result is 5 x 10

  Unit: microseconds
             expr      min       lq    median       uq       max neval
         KnownMax  106.556  114.692  122.4915  129.406  6427.521    50
       UnknownMax  114.470  122.561  128.9780  136.384   158.346    50
    DT.withAssign 3000.777 3099.729 3198.8175 3291.284 10415.315    50
 DT.withOutAssign 2637.023 2739.930 2814.0585 2903.904  9376.747    50
    lapply.Dickoa 7031.791 7315.781 7438.6835 7634.647 14314.687    50
  apply.SimonO101  430.350  465.074  487.9505  522.938  7568.442    50
   forLoop.Ananda   81.415   91.027   99.7530  104.588   265.394    50

Using the slightly larger data.frame (below) where the results is 1000 x 100 removing lapply.Dickoa as my edit might have slowed it down and as it stood it crashed.

   Unit: milliseconds
             expr      min       lq   median        uq       max neval
         KnownMax 34.83210 35.59068 36.13330  38.15960  52.27746    50
       UnknownMax 36.41766 37.17553 38.03075  47.71438  55.57009    50
    DT.withAssign 31.95005 32.65798 33.73578  43.71493  50.05831    50
 DT.withOutAssign 31.36063 32.08138 32.80728  35.32660  51.00037    50
  apply.SimonO101 78.61677 91.72505 95.53592 103.36052 163.14346    50
   forLoop.Ananda 13.61827 14.02197 14.18899  14.58777  26.42266    50

Even larger set where the results is 10,000 x 600

Unit: milliseconds
             expr       min        lq    median        uq       max neval
         KnownMax 1583.5902 1631.6214 1658.6168 1724.9557 1902.3923    50
       UnknownMax 1597.1215 1655.9634 1690.7550 1735.5913 1804.2156    50
    DT.withAssign  586.4675  641.7206  660.7330  716.0100 1193.4806    50
 DT.withOutAssign  587.0492  628.3731  666.3148  717.5575  776.2671    50
  apply.SimonO101 1916.6589 1995.2851 2044.9553 2079.6754 2385.1028    50
   forLoop.Ananda  163.4549  172.5627  182.6207  211.9153  315.0706    50

Using the following:

library(microbmenchmark)
library(data.table)

KnownMax <- quote(t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0)
UnknownMax <- quote({    splat <- strsplit(DF$B, ","); maximum <- max(as.numeric(unlist(splat))); t(sapply(splat, `%in%`, x=1:maximum)) + 0})
DT.withAssign <- quote({DT <- data.table(DF); DT.long <- DT[,  list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
DT.withOutAssign <- quote({DT.long <- DT[,  list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
lapply.Dickoa <- quote({ tmp <- strsplit(DF$B, ","); label <- 1:max(as.numeric(unlist(tmp))); tmp <- lapply(tmp, function(x) as.data.frame(lapply(label, function(y) (x == y)))); unname(t(sapply(tmp, colSums))) })
apply.SimonO101 <- quote({cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))));  t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) ) })
forLoop.Ananda <- quote({b = strsplit(DF$B, ","); ncol = max(as.numeric(unlist(b))); temp = lapply(b, as.numeric); m = matrix(0, nrow = nrow(DF), ncol = ncol)      ; for (i in 1:nrow(DF)) {  m[i, temp[[i]]] = 1 }; m })

# slightly modified @Dickoa's alogrithm to allow for instances were B is only a single number.  
#  Instead of using `sapply(.)`, I used `as.data.frame(lapply(.))` which hopefully the simplification process in sapply is analogous in time to `as.data.frame`

identical(eval(lapply.Dickoa), eval(UnknownMax))
identical(eval(lapply.Dickoa), unname(eval(apply.SimonO101)))
identical(eval(lapply.Dickoa), eval(KnownMax))
identical(unname(as.matrix(eval(DT.withAssign))), eval(KnownMax))
# ALL TRUE

this is what was used to create the sample data:

# larger data created as follows
set.seed(1)
maximum <- 600
rows <- 10000
DF <- data.frame(A=seq(rows), B=as.character(c(maximum, replicate(rows-1, paste(sample(maximum, sample(20), FALSE), collapse=",")))), stringsAsFactors=FALSE)
DT <- data.table(DF); 
DT

Thanks @AnandaMahto. Even with medium sized data, the base `R` solution is almost identical in speed to the `data.table` solution. — Ricardo Saporta, Apr 29 '13 at 04:34
+1 loving the benchmarks (apart from being trounced) but +1 goes for use of quote and eval for benchmarking - this is convenient for me to not rewrite some script commands into functions when I ma messing around! — Simon O'Hanlon, Apr 29 '13 at 08:22
@SimonO101, Ricardo's even asked [a question](http://stackoverflow.com/q/13713116/1270695) about this once.... — A5C1D2H2I1M1N2O1R2T1, Apr 29 '13 at 09:40

Simon O'Hanlon · Answer 3 · 2013-04-28T20:51:58.000

5

One way you could do this with ifelse and strsplit (unless I misunderstood and you don't want to use them?) is like this....

cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))))
df <- t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) )

colnames(df) <- cols
df
#  1 2 3 4 5 6 7 8 9 10
#1 1 1 1 0 0 0 0 0 0  0
#2 1 1 1 0 0 1 0 0 0  0
#3 1 1 1 0 1 0 1 0 0  0
#4 0 1 1 1 1 1 1 0 0  0
#5 0 0 1 1 1 1 1 0 0  1

The idea is that we get a vector of the unique values in your desired column, find the max value and create a vector 1:max(value) then apply across each row to find out which values for that row are in the vector of all values. We use ifelse to put a 1 if it's there and 0 if it's not. The vector we match in is a sequence so its output is ready sorted.

edited Apr 28 '13 at 20:51

answered Apr 28 '13 at 20:46

Simon O'Hanlon

58,647
14
142
184

I am looking for most efficient way, so if means that I shoud use ifelse or strsplit to do so, it would use this functions. ;) – Maciej Apr 28 '13 at 20:54
Your version took 29.31068 secs. :) – Maciej Apr 28 '13 at 20:58
Is that good enough?! Or do you have the need for more speed? I think you will *probably* need to use `strsplit` or similar, because your data is character vectors. This is probably where the most time is taken up. The `%in%` operation is very fast and `ifelse` is not too bad, and quite convenient code. – Simon O'Hanlon Apr 28 '13 at 21:13
Yes, it is. Thank you! Of course if there will be a faster way i wouldn't mind. – Maciej Apr 28 '13 at 21:23
@Maciej, `ifelse` by its very nature is slow (it is porcessing *all three* vectors). If you are looking for efficiency, then you should try to avoid it. One way of doing so is to intialize your vector (or better yet, entire matrix) to `0` and then use a standard `if` statement to set it to `1` were appropriate. – Ricardo Saporta Apr 29 '13 at 03:15
@RicardoSaporta I think it's a bit much to say `ifelse` is *very* slow. It is slower because it is convenience notation, and does process two `if` statements (well 4 actually, but not in the function test). However, only one expression will be evaluated because not both conditions can be TRUE at the same time, and the evaluation of the expression is where the time is spent. – Simon O'Hanlon Apr 29 '13 at 08:08
@SimonO101, if the `test` condition in the `ifelse` statement evaluates to only `TRUE` or only `FALSE`, then only one of the `yes`/`no` vectors are computed. If on the other hand both `TRUE` and `FALSE` are present in the `ifelse` `test` condition, then both vectors are computed. – Ricardo Saporta Apr 29 '13 at 09:11
@RicardoSaporta ok, yes I see what you mean. But `ifelse` is still vectorised and I do not think it is so slow, in the context of one using `if` statements. I would always just use logical comparison if possible, as I said `ifelse` is just convenience. – Simon O'Hanlon Apr 29 '13 at 09:17
@Simon O'Hanlon Nice 2-liner! I get about a 50% speed-up by using `as.numeric` to coerce the logicals to 0/1 rather than assigning them with `ifelse`. Nevertheless, the for loop still seems to take the cake... – treysp Aug 19 '15 at 23:46

Martin Morgan · Answer 4 · 2013-06-09T16:17:57.960

A little late to the game, but a different strategy uses the fact that a matrix can be indexed by another two-column matrix specifying row and and column indexes for update. So

f2 <- function(DF) {
    b <- strsplit(DF$B, ",", fixed=TRUE)
    len <- vapply(b, length, integer(1)) # 'geometry'
    b <- as.integer(unlist(b))

    midx <- matrix(c(rep(seq_len(nrow(DF)), len), b), ncol=2)
    m <- matrix(0L, nrow(DF), max(b))
    m[midx] <- 1L
    m
}

This uses strsplit(..., fixed=TRUE) and vapply for efficiency and type safety, and as.integer and 0L, 1L because we really want integer and not numeric return values.

For comparison, here's the original implementation from @AnandaMahto

f0 <- function(DF) {
    b = strsplit(DF$B, ",")
    ncol = max(as.numeric(unlist(b)))
    temp = lapply(b, as.numeric)
    m = matrix(0, nrow = nrow(DF), ncol = ncol)
    for (i in 1:nrow(DF)) {
        m[i, temp[[i]]] = 1
    }
    m
}

This can be improved for efficiency by using fixed=TRUE and avoiding the double coercion of b, and made more robust by coercing to integer and using seq_len(nrow(DF)) to avoid the corner case of 0-row DF

f1 <- function(DF) {
    b = lapply(strsplit(DF$B, ",", fixed=TRUE), as.integer)
    ncol = max(unlist(b))
    m = matrix(0L, nrow = nrow(DF), ncol = ncol)      
    for (i in seq_len(nrow(DF)))
        m[i, b[[i]]] = 1L
    m
}

The for loop is a good candidate for compilation, so

library(compiler)
f1c <- cmpfun(f1)

and then for comparison on the 10,000 x 600 data from @RicardoSaporta

> library(microbenchmark)
> microbenchmark(f0(DF), f1(DF), f1c(DF), f2(DF))
Unit: milliseconds
    expr       min        lq    median        uq      max neval
  f0(DF) 170.51388 180.25997 182.45772 188.23811 717.7511   100
  f1(DF)  91.53578  97.14909  97.97195 100.24236 447.5900   100
 f1c(DF)  79.39194  84.45712  85.71022  87.85763 411.8340   100
  f2(DF)  76.45496  81.70307  82.50752 110.83620 398.6093   100

Both the 2-fold increase from f0 to f1 and relative efficiency of the for loop were relatively surprising to me. @AnandaMahto's solution is more memory efficient, made more so without too much performance cost with

ncol = max(vapply(b, max, integer(1)))

I missed this "late to the game" answer, but really great step-by-step walk-through. Thanks for sharing! +1 — A5C1D2H2I1M1N2O1R2T1, Sep 09 '13 at 17:18

score 3 · Answer 5 · edited Apr 29 '13 at 04:32

I know there's already a good and quite efficient answer but we can use another approach too to get the same results.

tmp <- strsplit(DF$B, ",")
label <- 1:max(as.numeric(unlist(tmp)))
tmp <- lapply(tmp, function(x)
              sapply(label, function(y) (x == y)))

t(sapply(tmp, colSums))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    1    1    0    0    0    0    0    0     0
## [2,]    1    1    1    0    0    1    0    0    0     0
## [3,]    1    1    1    0    1    0    1    0    0     0
## [4,]    0    1    1    1    1    1    1    0    0     0
## [5,]    0    0    1    1    1    1    1    0    0     1

We can benchmark it now to compare with @SimonO101 solution (fun2)

require(rbenchmark)

fun1 <- function(DF) {
    tmp <- strsplit(DF$B, ",")
    label <- 1:max(as.numeric(unlist(tmp)))
    tmp <- lapply(tmp, function(x)
                  sapply(label, function(y) (x == y)))
    t(sapply(tmp, colSums))

}


fun2 <- function(DF) {
    cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))))
    df <- t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) )

    colnames(df) <- cols
    df
}


all.equal(fun1(DF),
          fun2(DF),
          check.attributes = FALSE)

## [1] TRUE


benchmark(fun1(DF),
          fun2(DF),
          order = "elapsed",
          columns = c("test", "elapsed", "relative"),
          replications = 5000)


##       test elapsed relative
## 1 fun1(DF)   1.870    1.000
## 2 fun2(DF)   2.018    1.079

As we can see there's not a big difference.

Suggested edit (RS):

# from: 
tmp <- lapply(tmp, function(x)
           sapply(label, function(y) (x == y)))

#  to: 
tmp <- lapply(tmp, function(x)
          as.data.frame(lapply(label, function(y) (x == y))))

+1 for the marginal gains. Nice use of colSums. I think my solution gets a tiny bit quicker if you remove the assignment `<-` to df, the `colnames` call and the `df` at the end which prints the data.frame which are not in your function. You are still quicker though by about 30 microseconds ( `0.000579` vs `0.000615` seconds ) :-) — Simon O'Hanlon, Apr 28 '13 at 22:21
Thank you ! you are right about `colnames` and `df` at the end. It will be great if we can have a fast `data.table` solution, I tried without succcess :( — dickoa, Apr 28 '13 at 22:32
@dickoa, this is great. However, if B has a value that is only a single number, I believe this breaks. — Ricardo Saporta, Apr 29 '13 at 03:55
@RicardoSaporta, I was wondering why I couldn't get it to work when trying some benchmarks on a larger dataset. Nice catch. — A5C1D2H2I1M1N2O1R2T1, Apr 29 '13 at 04:12
@dickoa, hope you don't mind, I added a suggestion at the bottom of your post. — Ricardo Saporta, Apr 29 '13 at 04:33
@RicardoSaporta Nice catch and great answer (I learn a lot)...thank you very much. — dickoa, Apr 29 '13 at 06:52

Simon O'Hanlon · Answer 6 · 2013-08-13T14:26:49.357

Ok, this has been bugging me for a while, but I thought it would be a good use of Rcpp. So I wrote a little function too see if I can get something faster than @Ananda's amazing for loop solution. This solution seems to run approximately twice as fast (using the larger sample dataset posted by @RicardoSaporta).

Note: I was attempting this more to teach myself how to use Rcpp and C++ than to provide a useful solution, but all the same...

Our .cpp file...

#include <Rcpp.h>
#include <string>
#include <sstream>

using namespace Rcpp;

//[[Rcpp::export]]

NumericMatrix expandR(CharacterVector x) {
    int n = x.size();
    std::vector< std::vector<int> > out;    // list to hold numeric vectors
    int tmax = 0;
    for(int i = 0; i < n; ++i) {
      std::vector<int> vect;                // vector to hold split strings
      std::string str = as<std::string>(x[i]);
      std::stringstream ss(str);
      int j = 0;
      while (ss >> j) {
      vect.push_back(j);  // add integer to result vector
        if (ss.peek() == ',') //split by ',' delim
          ss.ignore();
      }
     int it = *std::max_element(vect.begin(), vect.end());
      if( it > tmax )
        tmax = it;  //current max value
      out.push_back(vect);
    }
// Now we construct the matrix. tmax gives us number of columns, n is number of rows;
    NumericMatrix mat(n,tmax);
    for( int i = 0; i < n; ++i) {
      NumericMatrix::Row zzrow = mat( i , _ );
      std::vector<int> vec = out[i];
      for( int j = 0; j < vec.size(); ++j ) {
        zzrow[ (vec[j]-1) ] = 1; //don't forget R vs. C++ indexing
        }
    }
    return mat;
}

Using the nominal example from the OP we can then just do...

require(Rcpp)

##  source the function so it is available to use in R
sourceCpp("C:/path/to/file.cpp")

#  Call it like any other R function
expandR(DF$B)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    1    1    0    0    0    0    0    0     0
[2,]    1    1    1    0    0    1    0    0    0     0
[3,]    1    1    1    0    1    0    1    0    0     0
[4,]    0    1    1    1    1    1    1    0    0     0
[5,]    0    0    1    1    1    1    1    0    0     1

And using the larger dataset provided by @Ricardo) and comparing with @Ananda's solution)....

require(Rcpp)
require(data.table)
set.seed(1)
maximum <- 600
rows <- 10000
DF <- data.frame(A=seq(rows), B=as.character(c(maximum, replicate(rows-1, paste(sample(maximum, sample(20), FALSE), collapse=",")))), stringsAsFactors=FALSE)
DT <- data.table(DF); 
DT

##  source in our c code
sourceCpp("C:/Users/sohanlon/Desktop/expandR2.cpp")

forLoop.Ananda  <- quote({b = strsplit(DF$B, ","); ncol = max(as.numeric(unlist(b))); temp = lapply(b, as.numeric); m = matrix(0, nrow = nrow(DF), ncol = ncol)      ; for (i in 1:nrow(DF)) {  m[i, temp[[i]]] = 1 }; m })
rcpp.Simon      <- quote({mm = expandR( DT$B )})

require(microbenchmark)
microbenchmark( eval(forLoop.Ananda) , eval(rcpp.Simon) , times = 5L )
Unit: milliseconds
                 expr      min       lq   median       uq      max neval
 eval(forLoop.Ananda) 173.3024 178.6445 181.5881 218.9619 227.9490     5
     eval(rcpp.Simon) 115.8309 116.3876 116.8125 119.1971 125.6504     5

+1 Nice work! I know nothing about C++ and Rccp. It's on my list of "things I need to learn to answer questions on SO" :) — A5C1D2H2I1M1N2O1R2T1, Aug 13 '13 at 10:36
Yeah, the same for me. I have been making a concerted effort to try and learn. I *know* there is some inefficient stuff in here, but I'm still a newbie. This could be improved quite a bit I think. But I managed to get it working (after 600 compile errors) which was the main thing! :-) — Simon O'Hanlon, Aug 13 '13 at 10:37

score 0 · Answer 7 · answered Jul 19 '19 at 08:49

Not a particularly fast solution, however, it could be useful for those preferring tidyverse possibilities:

DF %>%
 mutate(B = str_split(B, fixed(","))) %>%
 unnest() %>%
 transmute(A,
           var = as.numeric(B),
           val = 1) %>%
 complete(var = seq(min(var), max(var), 1), nesting(A)) %>%
 spread(var, val, fill = 0)

      A   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     0     0     0     0     0     0     0
2     2     1     1     1     0     0     1     0     0     0     0
3     3     1     1     1     0     1     0     1     0     0     0
4     4     0     1     1     1     1     1     1     0     0     0
5     5     0     0     1     1     1     1     1     0     0     1

To have more compact column names:

DF %>%
 mutate(B = str_split(B, fixed(","))) %>%
 unnest() %>%
 transmute(A,
           var = as.numeric(B),
           val = 1) %>%
 complete(var = seq(min(var), max(var), 1), nesting(A)) %>%
 spread(var, val, fill = 0) %>%
 rename_at(2:length(.), ~ paste0("Col", 1:length(.)))

      A  Col1  Col2  Col3  Col4  Col5  Col6  Col7  Col8  Col9 Col10
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     0     0     0     0     0     0     0
2     2     1     1     1     0     0     1     0     0     0     0
3     3     1     1     1     0     1     0     1     0     0     0
4     4     0     1     1     1     1     1     1     0     0     0
5     5     0     0     1     1     1     1     1     0     0     1

Dummy variables from a string variable

7 Answers7

UPDATE

Update

Update:

Solution 1: `KnownMax`

Solution 2: `UnknownMax`

Solution 3: `DT`

Linked

Related

Dummy variables from a string variable

7 Answers7

UPDATE

Update

Update:

Solution 1: KnownMax

Solution 2: UnknownMax

Solution 3: DT

Linked

Related

Solution 1: `KnownMax`

Solution 2: `UnknownMax`

Solution 3: `DT`