2

I don't know if using dcast() is the right way, but I want to reshape the following data.frame:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c(14,14,16))
df
   x y  z
1 p1 a 14
2 p1 b 14
3 p2 a 16

so that it looks like this one:

df2 <- data.frame(x=c("p1","p2"),a=c(1,1),b=c(1,0),z=c(14,16))
   x a b  z
1 p1 1 1 14
2 p2 1 0 16

The variable y in df should be broken so that its elements are new variables, each dummy coded. All other variables (in this case just z) are equal for each person (p1,p2 etc.). The only variable where a specific person p has different values is y.
The reason I want this is because I need to merge this dataset with other ones by variable x. Thing is, it needs to be one row per person (p1,p2 etc).

joran
  • 169,992
  • 32
  • 429
  • 468
beginneR
  • 3,207
  • 5
  • 30
  • 52

4 Answers4

2

The following works, but seems cumbersome.

df2 <- df
df2$y <- as.numeric(y)
df$y2 <- as.numeric(df$y)

df2 <- dcast(df, x+z~y, value.var="y2")

df2
   x  z a  b
1 p1 14 1  2
2 p2 16 1 NA
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
2

This is almost a duplicate of a previous question, and the same basic answer I used there works again. No need for any external packages either.

aggregate(model.matrix(~ y - 1, data=df),df[c("x","z")],max)

   x  z ya yb
1 p1 14  1  1
2 p2 16  1  0

To explain this, as it is a bit odd looking, the model.matrix call at its most basic returns a binary indicator variable for each unique value for each row of your data.frame, like so:

  ya yb
1  1  0
2  0  1
3  1  0

If you aggregate that intermediate result by your two id variables (x and z), you are then essentially acting on the initial data.frame of:

   x  z ya yb
1 p1 14  1  0
2 p1 14  0  1
3 p2 16  1  0

So if you take the max value of ya and yb within each combination of x and z, you basically do:

   x  z ya      yb
1 p1 14  1*max*  0
2 p1 14  0       1*max*

--collapse--

   x  z ya      yb
1 p1 14  1       1

...and repeat that for each unique x/z combination to give the final result:

   x  z ya yb
1 p1 14  1  1
2 p2 16  1  0

Things get a bit crazy to generalise this to more columns, but it can be done, courtesy of this question e.g.:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c("14","15","16"))
intm <- model.matrix(~ y + z - 1, data=df,
                 contrasts.arg = sapply(df[2:3], contrasts, contrasts=FALSE))
aggregate(intm,df[c("x")],max)

   x ya yb z14 z15 z16
1 p1  1  1   1   1   0
2 p2  1  0   0   0   1
Community
  • 1
  • 1
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thank you, that is a nice and short solution even though I don't get it at the moment. The "max" seems weird. – beginneR Aug 08 '13 at 07:01
  • @Macs - i've added a longhand explanation, hope it helps. – thelatemail Aug 08 '13 at 07:16
  • Does this function also work if I have not just one variable but let's say two which need to be reshaped? ya,yb,k1,k2,k3 (second variable named "k" with Labels 1,2,3). thanks – beginneR Aug 08 '13 at 09:38
  • imagine the second element of z in df would be 15, not 14. in this case, variable z also Needs to be reshaped in the way y was. (z14,z15,z16). is this possible? – beginneR Aug 08 '13 at 09:44
  • it nearly works with model.matrix(~ y+z -1)... but because there are three Levels, one is left out in the model.matrix. – beginneR Aug 08 '13 at 09:51
  • it is also possible building two separate model matrices for each variable and then column binding both. thanks anyway! – beginneR Aug 09 '13 at 07:08
1

I'm not sure much of this you have to do but if you need a way to automate it, I wrote this little function that might help:

First run dcast:

new = dcast(df, x+z~y, value.var="y")

Load into your R environment:

 # args to be passed: 
 # df is your dataframe 
 # cols is a list of format c("colname1", "colname2", ... , "colnameN")
    binarizeCols = function(df, cols){
      for(i in cols){
        column = which(colnames(df) == i)
        truthRow = is.na(df[,column])
        for(j in 1:length(truthRow)){
          if(truthRow[j] == FALSE){
            df[j,column] = 1
          }else{
             df[j,column] = 0
           }
        }
      }
      return(df)
    }

then run:

new = binarizeCols(new, c("a", "b"))

and you get:

     x  z  a  b
   1 p1 14 1  1 
   2 p2 16 1  0

not as fast as using _apply() but there's no hardcoding, you can enter any colnames you want (maybe you want to skip one in the middle?) and you dont create a new instance of your df. note: I use "=" instead of "<-" because I thought it was being phased out but they can be replaced if need be.

santeko
  • 360
  • 5
  • 10
  • You could shorten your `binarizeCols` function substantially and do away with the loops - `binarizeCols <- function(x,cols) { x[cols] <- (!is.na(x[cols]))+0; x; }` – thelatemail Aug 08 '13 at 00:51
0
df <- data.frame(x=c("p1","p1","p2","p3"),
                 y=c("a","b","a","c"),
                 z=c(14,14,16,17))  # wanted larger test case.
new <- dcast(df, x+z~y, value.var="y")
new[3:5] <- sapply(lapply(new[3:5], '%in%', unique(df$y) ), as.numeric)
new
   x  z a b c
1 p1 14 1 1 0
2 p2 16 1 0 0
3 p3 17 0 0 1

First check for containment in a vector that summarizes the possible values to create columns of logical values. Then 'dummify' by taking as.numeric of those logical values.

IRTFM
  • 258,963
  • 21
  • 364
  • 487