3

I have asked the same question a few days ago ( click here), but didn't mention that a result using data.table would be appreciated

The "aggregate-solution" works fine, even though it is pretty slow! I am searching for a faster way to solve this.

I want to reshape the following data.frame:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c(14,14,16))
df
   x y  z
1 p1 a 14
2 p1 b 14
3 p2 a 16

so that it looks like this one:

df2 <- data.frame(x=c("p1","p2"),a=c(1,1),b=c(1,0),z=c(14,16))
   x a b  z
1 p1 1 1 14
2 p2 1 0 16

The variable y in df should be broken so that its elements are new variables, each dummy coded. All other variables (in this case just z) are equal for each person (p1,p2 etc.). The only variable where a specific person p has different values is y.
The reason I want this is because I need to merge this dataset with other ones by variable x. Thing is, it needs to be one row per person (p1,p2 etc).

Community
  • 1
  • 1
beginneR
  • 3,207
  • 5
  • 30
  • 52
  • I should have done that but I just didn't consider the "self-contained-aspect" when I posted this question. Because eddi already posted a way I think I'll leave it as it is. However, thanks for your comment – beginneR Aug 21 '13 at 20:02

1 Answers1

5

Going to wide format is a little awkward currently in data.table, but I think this works:

library(data.table)
dt = data.table(x=c("p1","p1","p2"),y=c("a","b","a"),z=c(14,14,16))

setkey(dt, x, y)
dt[CJ(unique(x), unique(y)), list(.N, z)][,
   setNames(as.list(c(N, z[!is.na(z)][1])), c(y, 'z')), by = x]
#    x a b  z
#1: p1 1 1 14
#2: p2 1 0 16

The CJ part joins by all combinations of unique x and y, and then in that join there is a hidden by-without-by that's used to compute counts via .N. Once you have those it's just a matter of putting them horizontally for each x together with any non-NA z (I chose the first) and that's accomplished using as.list. Finally setNames sets the column names correctly.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    @beginneR explanation added – eddi Aug 22 '13 at 12:45
  • thanks, this is fantastic! but now it gets even more complicated. what if there is another variable like y, e.g z. where values can also differ. is there a possibility to adjust the code for this scenario. thelatemail edited the first answer to the question I linked here using Aggregate for more than two columns. Would be perfect if this could also be done for your data.table solution. see also my comments to "his" answer in the linked question. (PS: sorry for linking to another question) – beginneR Aug 23 '13 at 14:38
  • 1
    @beginneR from what I can see the two variable case in the other question is just two separate runs of the above algo - just set the key to `x` and `y` first, then `x` and `z`, do the above for each case, and then `cbind` the results – eddi Aug 23 '13 at 14:51
  • hmm I get the following error message: "j doesn't evaluate to the same number of columns for each group". But this error doesn't concern cbind(). It's when I try the code of your answer just for a single variable. Didn't figure out why this doesn't work for my data yet. Do you have an idea? – beginneR Aug 23 '13 at 15:12
  • @beginneR that error means that the `j-expressions` produce results of different sizes for different values of your `by` argument - try pinpointing two conflicting `by` values and then you may be able to figure out how to fix it; one way to pinpoint would be to add a `print` statement in your `j-expression` and you'll be able to see where it breaks – eddi Aug 23 '13 at 15:16
  • i think I figured it out. I have rows where all values are equal, so after your first Operation, I have not just Zero and 1 values, but "2" and "3" as well. Think thats the problem – beginneR Aug 23 '13 at 15:19