Convert strings to multiple binary columns in R

Question

I've got a data frame that looks like:

Date          Names
1/1/2000      A|B
2/3/2003      A|C|D
2/4/2004      B|C|E

I need to convert it to:

Date          A B C D E
1/1/2000      1 1 0 0 0
2/3/2003      1 0 1 1 0
2/4/2004      0 1 1 0 1

So each unique name in the strings should become the header of a new column describing in what dates it was seen/not seen

see this question: http://stackoverflow.com/questions/15905806/improve-text-processing-speed-using-r-and-data-table/16179023#16179023 — eddi, May 01 '13 at 21:06
@eddi What is n in your code? I got an error when running the sparseMatrix function. — user2133354, May 01 '13 at 21:45
you'll have to be more specific - that code works for me (given you take the `dt` from the original question there) — eddi, May 01 '13 at 21:51
@eddi What should n be in the row starting with "rows = ..." — user2133354, May 02 '13 at 14:23
it's a column in the `data.table` called `tmp` that's constructed in the previous expression. You should really create the `dt` from OP and then run the code line by line. — eddi, May 02 '13 at 14:58
Thanks for your help. i ended up using the DT.GT_Mod solution there. — user2133354, May 02 '13 at 18:15

Remi.b · Answer 1 · 2013-05-01T20:08:38.930

1

Here is a brute force solution:

library(plyr)

fun.2 = function (x) {
        x[which(!is.na(match(names(x),strsplit(as.character(x[[2]]),'')[[1 ]][seq(1,length(strsplit(as.character(x[[2]]),'')[[1]]),by=2)])))] = 1
        return(x)
        }   
myfunction = function (df) {
    df1 = cbind(df,A=rep(0,nrow(df)),B=rep(0,nrow(df)),C=rep(0,nrow(df)),D=rep(0,nrow(df)),E=rep(0,nrow(df)))
    df2 = adply (df1,1,fun.2)
    return(df2) 
}

# you can run

myfunction ( df )

      Date Names A B C D E
1 1/1/2000   A|B 1 1 0 0 0
2 2/3/2003 A|C|D 1 0 1 1 0
3 2/4/2004 B|C|E 0 1 1 0 1

edited May 01 '13 at 20:08

answered May 01 '13 at 18:42

Remi.b

17,389
28
87
168

Thanks for the suggestion. It seems like the second for loop is very slow (and I have 64000 lines). Any suggestions how to make it faster? – user2133354 May 01 '13 at 19:21
Here is exactly the same solution than before but implemented in a adply function – Remi.b May 01 '13 at 20:09

Convert strings to multiple binary columns in R

1 Answers1