Expanding csv data in R

Question

I have data with the following column headers:

Sheet Number, Year, Term, Class, Debate #, Role in Debate, Gender of Evaluator, Person #1 Clarity, Person #1 Effort, Person #1 Gender,  Person #1 Origin, Debater Number, Person #2 Clarity, Person #2 Effort, Person #2 Gender,    Person #2 Origin, Debater Number, Person #3 Clarity, Person #3 Effort, Person #3 Gender, Person #3 Origin, Debater Number, Person #4 Clarity, Person #4 Effort, Person #4 Gender, Person #4 Origin, Debater Number, Person #5 Clarity, Person #5 Effort, Person #5 Gender, Person #5 Origin, Debater Number, Person #6 Clarity, Person #6 Effort, Person #6 Gender, Person #6 Origin, Debater Number, Person #7 Clarity, Person #7 Effort, Person #7 Gender, Person #7 Origin, Debater Number, Person #8 Clarity, Person #8 Effort, Person #8 Gender, Person #8 Origin, Debater Number, Learned from Team 1, Learned from Team 2, Who won?, Sheet all 10s?, Evaluator Class Year

and I'd like to transform it into the format with the following column headers:

Sheet Number, Year, Term, Class, Debate #, Gender of Evaluator, Evaluator Class Year, Role in Debate, Debate Team Member #, Debater Number, Gender of Debate Team Member, Origin of Debate Team Member, Clarity of Debate Team Member, Effort of Debate Team Member, Learned from Team 1, Learned from Team 2, Who won?, Sheet all 10s?, =1 if Gender of Evaluator==Gender of Debater

The main difference between the two is that in the first format, each sheet number has 5-8 numbered 'Persons' associated with it. The second format, each sheet number has a single person associated with it (thus each sheet number appears multiple times and the data is "expanded").

How might I accomplish this in R? I've been trying to use the 'reshape' package. Thanks!

Welcome to StackOverflow! Especially with the "width" of this type of data, I urge you to make a [reproducible minimal working example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). In this case, generalize the problem to a handful of columns where you can get the concept without flooding us. (Additionally, providing a representative data structure is very helpful.) — r2evans, Jan 03 '15 at 01:54
So the original format looks something like `Sheet Number, Person #1 Clarity, Person #1 Effort, Person #1 Gender, Person #1 Origin, Debater Number, Person #2 Clarity, Person #2 Effort, Person #2 Gender, Person #2 Origin, Debater Number, Person #3 Clarity, Person #3 Effort, Person #3 Gender, Person #3 Origin, Debater Number` and I'd like it to be something like `Sheet Number, Debate Team Member #, Debater Number, Gender of Debate Team Member, Origin of Debate Team Member, Clarity of Debate Team Member, Effort of Debate Team Member` (i.e. with each Sheet Number associated with just one person) — Political Science Gurl, Jan 03 '15 at 02:06
That's not a minimal reproducible example. I suggest you: 1. Make a small dataset. 2. Post what R code you ran against the dataset and what it produced. 3. What you expected it to produce. Then we should be able to help you. — James King, Jan 03 '15 at 02:14

score 5 · Answer 1 · edited Jan 03 '15 at 04:26

(I took this opportunity to learn a little about tidyr, and I'm glad I did.)

As @JamesKing suggested, what you provided isn't the best MWE, so I've created some data with similar structure. I think it's all as applicable to your example, though, so with some explanation you should be able to translate it to your data. Having said that, since it appears that you are starting with an Excel spreadsheet, it would be beneficial to come up with a naming convention that simplifies the gathering and separateing of the data.

My data:

set.seed(1)
n <- 5
dat <- data.frame(
  sheetNum = 1:n,
  year = sample(2000:2025, size = n),
  roleInDebate = sample(letters, size = n, replace = TRUE),
  Clarity.1 = sample(10, size = n, replace = TRUE),
  Effort.1 = sample(10, size = n, replace = TRUE),
  Clarity.2 = sample(10, size = n, replace = TRUE),
  Effort.2 = sample(10, size = n, replace = TRUE),
  Clarity.3 = sample(10, size = n, replace = TRUE),
  Effort.3 = sample(10, size = n, replace = TRUE))
dat
#   sheetNum year roleInDebate Clarity.1 Effort.1 Clarity.2 Effort.2 Clarity.3
# 1        1 2006            x         3        5        10        4         5
# 2        2 2009            y         2        8         3        1         6
# 3        3 2013            r         7       10         7        4         5
# 4        4 2020            q         4        4         2        9         2
# 5        5 2004            b         8        8         3        4         9

The types of data:

Static columns: sheetNum, year, roleInDebate. This data is not used anywhere else and will be copied to each row for each person. Nothing is gathered, separated, or spread based on these columns.
The remainder are columns with data embedded in the column name. By that I mean that Clarity.1 has the datum 1 stored within it, and it will need to be separated out smartly. Though I only have two columns per person, this easily translates to more.

Bottom Line, Up Front

(If you are not familiar with the %>% infix operator from dplyr and magrittr, I encourage you to research it elsewhere. It's handy and pivotal to understanding this suggested solution.)

And now for the solution, quite simply:

library(tidyr)
library(dplyr)
dat %>%
  gather(var, val, -sheetNum, -year, -roleInDebate) %>%
  separate(var, c('skill', 'person'), '\\.') %>%
  spread(skill, val)
#    sheetNum year roleInDebate person Clarity Effort
# 1         1 2006            x      1       3      5
# 2         1 2006            x      2      10      4
# 3         1 2006            x      3       5      7
# 4         2 2009            y      1       2      8
# 5         2 2009            y      2       3      1
# 6         2 2009            y      3       6      8
# 7         3 2013            r      1       7     10
# 8         3 2013            r      2       7      4
# 9         3 2013            r      3       5      2
# 10        4 2020            q      1       4      4
# 11        4 2020            q      2       2      9
# 12        4 2020            q      3       2      8
# 13        5 2004            b      1       8      8
# 14        5 2004            b      2       3      4
# 15        5 2004            b      3       9      5

Breaking It Down

To see what is happening, let's step through this. The gather step simply combines unmentioned columns into a key/value pair of columns, like this:

dat %>% gather(var, val, -sheetNum, -year, -roleInDebate) %>% head()
#   sheetNum year roleInDebate       var val
# 1        1 2006            x Clarity.1   3
# 2        2 2009            y Clarity.1   2
# 3        3 2013            r Clarity.1   7
# 4        4 2020            q Clarity.1   4
# 5        5 2004            b Clarity.1   8
# 6        1 2006            x  Effort.1   5

Notice how the columns I included prepended with a - are kept verbatim. Next, we need to split up (or separate) the var column:

dat %>%
  gather(var, val, -sheetNum, -year, -roleInDebate) %>%
  separate(var, c('skill', 'person'), '\\.') %>% head()
#   sheetNum year roleInDebate   skill person val
# 1        1 2006            x Clarity      1   3
# 2        2 2009            y Clarity      1   2
# 3        3 2013            r Clarity      1   7
# 4        4 2020            q Clarity      1   4
# 5        5 2004            b Clarity      1   8
# 6        1 2006            x  Effort      1   5

Not a lot happened here, but it's very important for the next step: widening the data, or spreading it from a different key/value pair of columns (now using skill and val), which creates new columns named Clarity and Effort, as we saw in the solution, above.

Hope this helps.

BTW: for a good MWE, it is often recommended to provide the output from dput(dat) where dat is a small but representative data structure that helps us to understand the starting point and your intended output. Here, a small data.frame for both would have been appropriate.

Nicely presented answer. (+1) – A5C1D2H2I1M1N2O1R2T1 Jan 03 '15 at 04:26 — A5C1D2H2I1M1N2O1R2T1, Jan 03 '15 at 04:26

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2015-01-03T06:01:19.083

Assuming that @r2evans's sample data is somewhat representative of your problem, here are a few other options to consider.

Option 1: Base R's reshape

Oddly enough, the function named precisely after the task you describe--reshape()--seems to be the "black sheep" when it comes to reshaping tools. If it's part of your regular toolkit, it's not too difficult to get the hang of though.

For this problem, an approach might be:

reshape(dat, direction = "long", idvar = 1:3, varying = 4:ncol(dat), sep = ".")

My concerns with reshape() is not its syntax, but (1) it can't handle unbalanced wide-to-long conversions (for example, if you had 3 "Clarity" columns but only 2 "Effort" columns), and (2) it can be painfully slow when you start dealing with lots of rows or lots of columns to be reshaped. As such, I wrote merged.stack.

Option 2: merged.stack

I wrote merged.stack as part of my "splitstackshape" package to handle similar reshaping tasks to what reshape(., direction = "long", ...) would do (which is different, for instance, from what melt from "reshape"/"reshape2" (and subsequently, gather, from "tidyr") does). I also wanted to simplify the process of selecting the variables of interest just by identifying the variable "stubs" (in this case, "Clarity" and "Effort").

As mentioned before, merged.stack was also designed to be fast.

library(splitstackshape)
merged.stack(dat, var.stubs = c("Clarity", "Effort"), sep = ".")
#     sheetNum year roleInDebate .time_1 Clarity Effort
#  1:        1 2006            x       1       3      5
#  2:        1 2006            x       2      10      4
#  3:        1 2006            x       3       5      7
#  4:        2 2009            y       1       2      8
#  5:        2 2009            y       2       3      1
#  6:        2 2009            y       3       6      8
#  7:        3 2013            r       1       7     10
#  8:        3 2013            r       2       7      4
#  9:        3 2013            r       3       5      2
# 10:        4 2020            q       1       4      4
# 11:        4 2020            q       2       2      9
# 12:        4 2020            q       3       2      8
# 13:        5 2004            b       1       8      8
# 14:        5 2004            b       2       3      4
# 15:        5 2004            b       3       9      5

Option 3: Wait for melt from "data.table" version 1.9.8

OK. Well, this may not really be an option (but the "data.table" developers work fast, so who knows) but in version 1.9.8 of "data.table" you would be able to melt specified lists of columns. See this issue for more details. Or, if you're more adventurous, install the 1.9.8 branch and try it out right away :-)

Eventually, this might make merged.stack redundant, since it will have similar functionality but even better speed (from what I can see in a few trials I've done).

Update -- benchmarks

I've tested just merged.stack and @r2evan's approach below. I didn't test reshape() because I was afraid it would slow my system to a crawl. I didn't test melt from "data.table" because it would be better to wait for a production release.

Here's some sample data:

set.seed(1)
n <- 100000
r <- 6
dat <- data.frame(
  sheetNum = 1:n,
  year = sample(2000:2025, size = n, TRUE),
  roleInDebate = sample(letters, size = n, replace = TRUE),
  matrix(sample(10, n * r * 2, TRUE), nrow = n, 
         dimnames = list(NULL, paste(c("Clarity", "Effort"), 
                                     rep(seq_len(r), each = 2), 
                                     sep = "."))))

Here are the two functions tested:

r2evans <- function() {
  dat %>%
    gather(var, val, -sheetNum, -year, -roleInDebate) %>%
    separate(var, c('skill', 'person'), '\\.') %>%
    spread(skill, val)
}

ananda <- function() {
  merged.stack(dat, var.stubs = c("Clarity", "Effort"), sep = ".")
}

Here are the results on 10 runs:

library(microbenchmark)
microbenchmark(r2evans(), ananda(), times = 10)
# Unit: milliseconds
#       expr       min        lq      mean    median        uq       max neval
#  r2evans() 3514.0961 3603.7102 3839.6097 3713.6705 3959.5320 4380.4601    10
#   ananda()  320.5602  336.2396  363.7165  367.3344  386.3064  417.7994    10

And some verification that the output is the same:

out1 <- r2evans()
out2 <- ananda()

library(compare)
compare(out1, out2, allowAll = TRUE)
# TRUE
#   renamed
#   dropped names
#   dropped attributes

@r2evans, I see the wink, but FYI, it's not really about showing off, especially since I really do think my function will be made redundant by the work Arun's doing on `melt` for "data.table". The standard approach using "reshape2" tools (similar to "dpyr" + "tidyr") is to smash everything into a long skinny dataset and reshape from there--ignoring what the column types are, and then spreading them back out to a wide form. That seems unnecessary. `merged.stack` tries to avoid that, as does the "data.table" implementation of `melt`. This could be a huge deal with bigger datasets. — A5C1D2H2I1M1N2O1R2T1, Jan 03 '15 at 06:18
To understand what I mean by "ignoring what the column types are", see [this comment](https://github.com/Rdatatable/data.table/issues/828#issuecomment-62208636) where there are multiple column types being melted at once. `spread` resorts to using `type.convert` which would also add to the processing time. (I also use `type.convert` in `cSplit` and related functions in "splitstackshape".) — A5C1D2H2I1M1N2O1R2T1, Jan 03 '15 at 06:19
Point taken, but neither was I competing nor do I think this specific problem was about performance as much as simplifying data stored in a human-maintained spreadsheet. I often prefer performance-efficient code and haven't been playing with either of these packages sufficiently to "feel" the performance hits. Thanks for the pointers, and this is a great reminder that "elegant" is not always faster, and "base" has already been vetted fairly well for speed. — r2evans, Jan 03 '15 at 06:23

Expanding csv data in R

2 Answers2

Bottom Line, Up Front

Breaking It Down

Update -- benchmarks