Convert R dataframe from long to wide format, but with unequal group sizes, for use with qcc

Question

I would like to convert a dataframe from long format to a wide format, but with unequal group sizes.

The eventual use will be in 'qcc', which requires a data frame or a matrix with each row consisting of one group, using NA's in groups which have fewer samples.

The following code will create an example dataset, as well as show manual conversion to the desired format.

# This is an example of the initial data that I have
# * 10 sample measurements, over 3 groups with 3, 2, and 5 elements respectively
x <- rnorm(10)
x_df <- data.frame( time = c( rep('2001 Q1',3), rep('2001 Q2',2), rep('2001 Q3',5) ), measure = x )
x_df

# This is a manual conversion into the desired format
x_pad <- c( x[1:3], NA, NA, x[4:5], NA, NA, NA, x[6:10] )
x_matrix <- matrix( x_pad, nrow = 3, ncol = 5, byrow = TRUE, dimnames = list(c('2001 Q1','2001 Q2','2001 Q3')) )
x_matrix # desired format

# An example of how it will be used
library(qcc)
plot(qcc(x_matrix, type = 'xbar', plot = FALSE))

So, I'd like to convert this:

      time     measure
1  2001 Q1  0.14680685
2  2001 Q1  0.53593193
3  2001 Q1  0.56097974
4  2001 Q2 -1.48102689
5  2001 Q2  0.18150972
6  2001 Q3  1.72018147
7  2001 Q3 -0.08480855
8  2001 Q3 -2.23208877
9  2001 Q3 -1.15269107
10 2001 Q3  0.57975023

... to this ...

              [,1]        [,2]       [,3]      [,4]      [,5]
2001 Q1  0.1468068  0.53593193  0.5609797        NA        NA
2001 Q2 -1.4810269  0.18150972         NA        NA        NA
2001 Q3  1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502

There is probably an easy way (perhaps some usage of reshape or reshape2 casting that I'm not familiar with?), but a bunch of searching hasn't helped me so far.

Thanks for any help!

==========

From one of the solutions below, the following will generate the final qcc xbar plot, including group labels:

library(splitstackshape)
out_df <- dcast( getanID( x_df, 'time' ), time~.id, value.var='measure' )
qcc( out_df[,-1], type = 'xbar', labels = out_df[,1] )

score 7 · Answer 1 · answered Jul 29 '15 at 12:16

7

You'll need an intermediate variable that gives a "within-time" id. You can create it and reshape like this

library(tidyr)
library(dplyr)

group_by(X, time) %>%
  mutate(seq = 1:n()) %>%
  ungroup() %>%
  spread(seq, measure)

answered Jul 29 '15 at 12:16

Benjamin

16,897
6
45
65

There is also a `row_number()` that can be handy as well. – akrun Jul 29 '15 at 12:55
This saved me! I had no idea I had to use `ungroup()` inbetween `group_by()` and `spread()`. Do you know why that is necessary? – Leo Aug 23 '17 at 11:33
Honestly, I don't know if you _have_ to. I just have an obsessive habit of ungrouping my data unless I have a reason to keep in grouped in subsequent commands. I get the same results on this sample of data whether I ungroup or not. – Benjamin Aug 23 '17 at 18:19

akrun · Accepted Answer · 2018-07-27T17:04:57.757

7

You can create a sequence column ('.id') using getanID from splitstackshape and use dcast from data.table to convert the long format to wide format. The output of splitstackshape is a data.table. When we load splitstackshape, data.table will also be loaded. So, if you already have the devel version of data.table, then the dcast from data.table can be used as well.

library(splitstackshape)
dcast(getanID(df1, 'time'), time~.id, value.var='measure')
#     time          1           2          3         4         5
#1: 2001 Q1  0.1468068  0.53593193  0.5609797        NA        NA
#2: 2001 Q2 -1.4810269  0.18150972         NA        NA        NA
#3: 2001 Q3  1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502

Update

As @snoram mentioned in the comments, function rowid from data.table makes it easier to use just data.table alone

library(data.table)
dcast(setDT(df1), time ~ rowid(time), value.var = "measure")

edited Jul 27 '18 at 17:04

answered Jul 29 '15 at 12:20

akrun

874,273
37
540
662

I like this solution in particular, because of how concise it is and because it keeps it as a dataframe. 'qcc' doesn't seem to like data tables very much. Not sure if it's proper StackOverflow etiquette, but I will edit my question to include how to use the above solution in the actuall qcc plotting as well, in case anyone finds this question looking for qcc solutions. – jhchou Jul 29 '15 at 12:54
@JChou If you want to convert the output to `data.frame`, you can use `setDF(res)` where `res <- dcast(getanID(...` – akrun Jul 29 '15 at 12:56
Thank you, awesome! I was doing a whole bunch of as.data.frame() conversions on the data tables. To be honest, I just need to learn about data tables! (I'm relatively new to R in general). – jhchou Jul 29 '15 at 12:59
@JChou I think the `setDF/setDT` functions were introduced in v1.9.3 or so. The devel version have many interesting functions including `dcast` to reshape multiple value columns and `melt` to convert to long with multiple measure columns. – akrun Jul 29 '15 at 13:01
3

I believe that in 2018 one would do: `dcast(x_df, time ~ rowid(time), measure.vars = "measure")`, no need for `splitstackshape`. – s_baldur Jul 27 '18 at 14:28
1

Thank you! This saved me from doing it in a very filthy way. – psyguy Mar 24 '21 at 10:50
I posted a similar question https://stackoverflow.com/questions/70128868/combine-long-format-data-frames-with-different-length-and-convert-to-wide-format but stumbled upon your answer. So thank you! However, I do wonder how to also group by `id` if it contains many individuals, as is the case in my dataset – cliu Nov 26 '21 at 20:02

score 5 · Answer 3 · edited Jul 29 '15 at 12:43

5

Another splitstackshape approach

cSplit(setDT(df)[, toString(measure), by='time'], 'V1', ',')

#      time       V1_1        V1_2       V1_3      V1_4      V1_5
#1: 2001 Q1  0.1468068  0.53593193  0.5609797        NA        NA
#2: 2001 Q2 -1.4810269  0.18150972         NA        NA        NA
#3: 2001 Q3  1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502

Or using the devel version of data.table a similar approach after pasting together the 'measure' by the grouping column 'time' would be using tstrsplit to split the 'V1' column generated from toString(measure).

 setDT(df)[, toString(measure), by ='time'][, c(list(time), tstrsplit(V1, ', '))]

Also, we can add type.convert=TRUE in tstrsplit to convert the class of the split columns. By default it is FALSE.

edited Jul 29 '15 at 12:43

akrun

874,273
37
540
662

answered Jul 29 '15 at 12:28

Veerendra Gadekar

4,452
19
24

1

That is a cute way. I didn't think about this. – akrun Jul 29 '15 at 12:30
Thanks @akrun :) I think I got to know this way from one of your old answers – Veerendra Gadekar Jul 29 '15 at 12:31
1

Using the devel version, you can also use `tstrsplit`. Something like `setDT(df1)[, toString(measure), by ='time'][, c(list(time), tstrsplit(V1, ', '))]` – akrun Jul 29 '15 at 12:36
Your solution is more compact though. I will add that. – akrun Jul 29 '15 at 12:42

Convert R dataframe from long to wide format, but with unequal group sizes, for use with qcc

3 Answers3

Update

Linked

Related