Condensing multiple observations on the same individual into a single row, adding multiples as new columns

Question

I have data that looks like this:

ID  X1    X2    X3

1   1.4   2     two
2   7.6   30    thirty
2   7.6   50    fifty
2   7.6   40    forty
3   5.6   40    forty
3   5.6   50    fifty
4   3.5   NA    NA
5   NA    2     two

ID indicates individuals, X1 is a measurement taken once per individual, and X2 and X3 are character and string representations of the same treatment. E.g., for individual 2, X1= 7.6, and was subjected to treatment X2 = 30, 50, and 40 (which has associated other info, X3 = thirty, fifty, and forty)

I want it to end up looking like this, with only one row per individual, and values from multiple observations on the same individual added as new columns:

ID   X1     X2a   X3a       X2b   X3b     X2c  X3c
1    1.4    2     two       NA    NA      NA   NA
2    7.6    30    thirty    50    fifty   40   forty
3    5.6    40    forty     50    fifty   NA   NA
4    3.5    NA    NA        NA    NA      NA   NA
5    NA     2     two       NA    NA      NA   NA

I want to keep rows with X1=NA (like subject 5), because those rows have measurements of interest in other columns that are mostly excluded for the simplicity of my example. My actual dataset is pretty big (10,000s of rows, and maybe 50 columns), so a method that can efficiently handle a fairly big dataset would be a plus.

I've been playing with melt() and cast() from the reshape package, which I think can do what I want. But, I am getting an error requiring me to aggregate (Error: Aggregation requires fun.aggregate: length used as default), which I don't want to do. Suggestions for getting melt and cast to cooperate? or a better way to do this?

I've found a number of similar questions (for example, this one), but none seem to address my problem as they don't start with a similar data structure or they don't care which instances of the data end up matched with which...

I hate to be this guy, but it would help to know why you are doing this. Are you trying to plot data, or do a linear model, or what? Cause if any of those is the goal, the shape it is currently in might be suitable. — Andrew Barr, Feb 03 '14 at 22:08
@AndrewBarr, my goals are to compare measurements of X1 by various conditions of X2, (t-tests, followed up with linear or other appropriate models) and make corresponding plots. I want to avoid counting X1 measurements from the same individual (ID) twice, but want to be able to use each X2 condition as a category for testing. Though each individual can be in >1 X2 category, for the purposes of each t-test, individuals either meet a condition or they don't - e.g. if I compare X2=30 vs X2=NOT-30, I want individual 2 be in the 30 category and not the NOT-30 category — Brad, Feb 12 '14 at 18:50

score 4 · Accepted Answer · answered Feb 03 '14 at 22:09

Here's one option:

library(data.table)
dt = data.table(your_df)

# get number of columns first (6 here)
max.N = max(dt[, .N*ncol(.SD), by = list(ID, X1)]$V1)

# now construct the result by filling in appropriate # of NA's
dt[, as.list(c(t(.SD), rep(NA, max.N - .N*ncol(.SD)))), by = list(ID, X1)]
#   ID  X1 V1     V2 V3    V4 V5    V6
#1:  1 1.4  2    two NA    NA NA    NA
#2:  2 7.6 30 thirty 50 fifty 40 forty
#3:  3 5.6 40  forty 50 fifty NA    NA
#4:  4 3.5 NA     NA NA    NA NA    NA
#5:  5  NA  2    two NA    NA NA    NA

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

As of version 1.9.6 (on CRAN 2015-09-19), data.table's dcast() function can cast multiple value.var columns simultaneously.

dcast(DT, ID + X1 ~ rowid(ID), value.var = c("X2", "X3"))

   ID  X1 X2_1 X2_2 X2_3   X3_1  X3_2  X3_3
1:  1 1.4    2   NA   NA    two    NA    NA
2:  2 7.6   30   50   40 thirty fifty forty
3:  3 5.6   40   50   NA  forty fifty    NA
4:  4 3.5   NA   NA   NA     NA    NA    NA
5:  5  NA    2   NA   NA    two    NA    NA

Data

library(data.table)
DT <- fread(
"ID  X1    X2    X3
  1   1.4   2     two
  2   7.6   30    thirty
  2   7.6   50    fifty
  2   7.6   40    forty
  3   5.6   40    forty
  3   5.6   50    fifty
  4   3.5   NA    NA
  5   NA    2     two"
)

Condensing multiple observations on the same individual into a single row, adding multiples as new columns

2 Answers2

Data

Linked