0

I want to create a matrix with 3 columns and many rows assigning 1 or 0 if the condition is satisfied.

I have data stored in 3 variables

 df1 <- data.frame(names=c("A","B","C","D","E","F"))
 df2 <- data.frame(names=c("A","B","C","F"))
 df3 <- data.frame(names=c("E","F","H"))

output will be

     df1  df2 df3
  A  1    1   0
  B  1    1   0 
  C  1    1   0
  D  1    0   0
  E  1    1   1
  F  1    0   1
  H  0    0   1

In first row if A is present in dataset then I will assign 1 under each column and 0 if A not present in dataset

Here is what I have tried

 DF <- rbind(df1,df2,df3)
 for (i in DF) { 
     for (j in 1:length(df1$names)) {
              if(i == df1$names[j]){
                    A3 <-data.frame(paste0("",i),paste0(1),paste0(0),paste0(0))
                    names(A3) <- NULL 
              }
              else{
                    A3 <-data.frame(paste0("",i),paste0(0),paste0(0),paste0(0))

              }
  }
}

I have written this code only for df1 but its very slow because I have more than 1500 rows in my orignal data set. What would be the fastest way to do it?

Cath
  • 23,906
  • 5
  • 52
  • 86
hash
  • 103
  • 10

4 Answers4

3

Add a grouping variable to each dataframe:

df1 <- data.frame(names=c("A","B","C","D","E","F"),group="df1")
df2 <- data.frame(names=c("A","B","C","F"),group="df2")
df3 <- data.frame(names=c("E","F","H"),group="df3")
DF <- rbind(df1,df2,df3)

Then do this:

   res <- table(DF)

> res
     group
names df1 df2 df3
    A   1   1   0
    B   1   1   0
    C   1   1   0
    D   1   0   0
    E   1   0   1
    F   1   1   1
    H   0   0   1

Or if you want a dataframe:

library(reshape2)
dcast(names~group, data=DF,fun.aggregate = length)
Heroka
  • 12,889
  • 1
  • 28
  • 38
  • This works for the exampledata you've given, I don't now if there are multiple instances of the same string in your data? – Heroka Sep 08 '15 at 10:17
  • Thank you so much Heroka. I don't have duplicates in my data – hash Sep 08 '15 at 10:43
1

When using the idcol parameter in rbindlist of the data.table package, there is no need of creating a grouping column for each dataframe separately:

library(data.table) # I used v1.9.5 for this
DT <- rbindlist(list(df1, df2, df3), idcol="id")
dcast(DT[, .N , by=.(id,names)], names ~ id, fill=0)

which gives:

   names 1 2 3
1:     A 1 1 0
2:     B 1 1 0
3:     C 1 1 0
4:     D 1 0 0
5:     E 1 0 1
6:     F 1 1 1
7:     H 0 0 1
Jaap
  • 81,064
  • 34
  • 182
  • 193
0

%in% operator lets you check if a string is present in a vector of strings. It is also vectorised, so it works quite quick:

x=c(LETTERS[c(1:6,8)])
df=data.frame(x=x,df1=as.numeric(x %in% df1$names),
          df2=as.numeric(x %in% df2$names),
          df3=as.numeric(x %in% df3$names))
df

If speed is crucial, {data.table} package gives a little speed boost with %chin% operator:

library(data.table)
x=c(LETTERS[c(1:6,8)])
dt=data.table(x=x,df1=as.numeric(x %chin% as.character(df1$names)),
          df2=as.numeric(x %chin% as.character(df2$names)),
          df3=as.numeric(x %chin% as.character(df3$names)))
dt
Maksim Gayduk
  • 1,051
  • 6
  • 13
0

The code below is slightly more general than the other answers. Also, I think it's useful to know how to dynamically create commands... I use the data frames as you prepared them:

df1 <- data.frame( names = c( "A", "B", "C", "D", "E", "F") )
df2 <- data.frame( names = c( "A", "B", "C"," F") )
df3 <- data.frame( names = c( "E", "F", "H") )

DF <- rbind( df1, df2, df3 )
nDF <- unique( DF ) #we don't want to duplicate tests.

Then the main loop is just like this:

n_ <- 3
for( ii in 1 : n_){
nDF[ paste0( "df", ii ) ] <- as.logical( NA ) #dynamically creates a new variable in your data frame

cmnd <-  paste0("nDF$names %in% df",ii,"$names") #dynamically creates the appropriate command (in this case you want to test e.g. whether "nDF$names %in% df1$names".

nDF[ paste0("df",ii)]  <- eval( parse( text = cmnd ) ) #evaluates the dynamically created command and saves it into the previously created variable.
}

Should be relatively fast. But if you don't have duplicates in your data, then heroka's suggestion to this questions is probably the way to go.

Community
  • 1
  • 1
DH4wes
  • 71
  • 5