1

I've got a data frame that looks like:

Date          Names
1/1/2000      A|B
2/3/2003      A|C|D
2/4/2004      B|C|E

I need to convert it to:

Date          A B C D E
1/1/2000      1 1 0 0 0
2/3/2003      1 0 1 1 0
2/4/2004      0 1 1 0 1

So each unique name in the strings should become the header of a new column describing in what dates it was seen/not seen

  • see this question: http://stackoverflow.com/questions/15905806/improve-text-processing-speed-using-r-and-data-table/16179023#16179023 – eddi May 01 '13 at 21:06
  • @eddi What is n in your code? I got an error when running the sparseMatrix function. – user2133354 May 01 '13 at 21:45
  • you'll have to be more specific - that code works for me (given you take the `dt` from the original question there) – eddi May 01 '13 at 21:51
  • @eddi What should n be in the row starting with "rows = ..." – user2133354 May 02 '13 at 14:23
  • it's a column in the `data.table` called `tmp` that's constructed in the previous expression. You should really create the `dt` from OP and then run the code line by line. – eddi May 02 '13 at 14:58
  • Thanks for your help. i ended up using the DT.GT_Mod solution there. – user2133354 May 02 '13 at 18:15

1 Answers1

1

Here is a brute force solution:

library(plyr)

fun.2 = function (x) {
        x[which(!is.na(match(names(x),strsplit(as.character(x[[2]]),'')[[1 ]][seq(1,length(strsplit(as.character(x[[2]]),'')[[1]]),by=2)])))] = 1
        return(x)
        }   
myfunction = function (df) {
    df1 = cbind(df,A=rep(0,nrow(df)),B=rep(0,nrow(df)),C=rep(0,nrow(df)),D=rep(0,nrow(df)),E=rep(0,nrow(df)))
    df2 = adply (df1,1,fun.2)
    return(df2) 
}

# you can run

myfunction ( df )

      Date Names A B C D E
1 1/1/2000   A|B 1 1 0 0 0
2 2/3/2003 A|C|D 1 0 1 1 0
3 2/4/2004 B|C|E 0 1 1 0 1
Remi.b
  • 17,389
  • 28
  • 87
  • 168
  • Thanks for the suggestion. It seems like the second for loop is very slow (and I have 64000 lines). Any suggestions how to make it faster? – user2133354 May 01 '13 at 19:21
  • Here is exactly the same solution than before but implemented in a adply function – Remi.b May 01 '13 at 20:09