5

I'm trying to figure out the arguments for gather in the tidyr package.

I looked at the documentation, and the syntax looks like:

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)

There is an example in the help files:

stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

gather(stocks, stock, price, -time)

I'm curious about the last line:
gather(stocks, stock, price, -time)

Here, stocks is clearly the data we want to modify, which is fine.

So I can read that stock and price are arguments to a key value pair -- but how does this function decide how to select columns to create this key value pair? The original dataframe looks like this:

time        X            Y          Z
2009-01-01  1.10177950  -1.1926213  -7.4149618
2009-01-02  0.75578151  -4.3705737  -0.3117843
2009-01-03  -0.23823356 -1.3497319  3.8742654
2009-01-04  0.98744470  -4.2381224  0.7397038
2009-01-05  0.74139013  -2.5303960  -5.5197743

I don't see any indication that we should use any combination of X, Y or Z. When I'm using this function, I feel like I'm just choosing names for what I want the columns in my long formatted dataframe to be, and praying that gather magically works. Come to think of it, I feel the same way when I use melt.

Does gather look at the column's type? How does it map from wide to long?

EDIT Great answer below, great discussion below, and for anyone else wanting more info on the philosophy and use of the tidyr package should definitely read this paper, although the vignette doesn't explain the syntax.

Community
  • 1
  • 1
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149
  • The `-time` says to use all the columns except time. Another approach would be to use `gather(stocks, stock, value, X:Z)`, if you prefer to specify which columns should be "gathered". Or even, `gather(stocks, stock, value, X, Y, Z)`. Essentially, this is more like using `melt` with `measure.vars` argument instead of specifying the `id.vars` (`melt(stocks, measure.vars = c("X", "Y", "Z"))`). – A5C1D2H2I1M1N2O1R2T1 Jan 25 '15 at 05:57

1 Answers1

11

In "tidyr", you specify the measure variables for gather in the ... argument. This is a little bit different conceptually from melt, where many examples (even many answers here on SO) would show the use of the id.vars argument (with the assumption that anything that is not specified as an ID is a measurement).

The ... argument can also take a - column name, as in the example you have shown. This basically says to "gather all of the columns except for this one". Another shorthand approach in gather includes specifying a range of columns by using the colon, for example, gather(stocks, stock, price, X:Z).

You can compare gather with melt by looking at the code for the function. Here are the first few lines:

> tidyr:::gather_.data.frame
function (data, key_col, value_col, gather_cols, na.rm = FALSE, 
    convert = FALSE) 
{
    data2 <- reshape2::melt(data, measure.vars = gather_cols, 
        variable.name = key_col, value.name = value_col, na.rm = na.rm)
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Hmm... That's funny that `gather` is just `melt` wrapper.. What was the point in creating it then? – David Arenburg Jan 25 '15 at 09:43
  • @DavidArenburg, I think that there's only one person who knows the answer to that. `spread` isn't just `dcast`, so perhaps this is just a template for now? It seems a little awkward to me that the overall philosophy between "reshape2" and "tidyr" diverge a fair amount, but perhaps that's why it's a totally different package.... – A5C1D2H2I1M1N2O1R2T1 Jan 25 '15 at 10:37
  • 3
    @DavidArenburg the point is that `gather()` is in general, much easier for people to understand, and it's symmetric with `spread()` (unlike `melt()` and `cast()`) – hadley Jan 27 '15 at 01:26
  • @hadley I find `gather` way harder to understand than `dcast`. In `dcast` you were specifying *all* the variables in interest, while in `gather` you alway need to keep in mind that the ID column should be ignored for some reason. [This question](http://stackoverflow.com/questions/26536251/comparing-gather-tidyr-to-melt-reshape2/) illustrates pretty well the confusion. `dcast` was also much more flexible the `spread` and had the `fun.aggregate` argument. It's just funny to me the hype around `tidyr` when it's both far less flexible/convenient and just a wrapper for `reshape2`... – David Arenburg Jan 27 '15 at 08:04
  • @DavidArenburg, I don't think your linked question is a great illustration. At some point, there's simply a different "philosophy" between the two approaches, and Tyler's question showed that he didn't understand the philosophy yet. Similarly, `spread` isn't designed to do everything that `dcast` does. Instead, it's expected to be a part of a "sentence" or a "paragraph" that describes how to manipulate the data, in which the aggregation step would be one of the earlier phrases. – A5C1D2H2I1M1N2O1R2T1 Jan 27 '15 at 11:51
  • @AnandaMahto I completely agree that `tidyr` is different "philosophy", I'm just arguing hadleys statement that it is *much easier for people to understand*, thus the linked question to illustrate my point. I'm saying only my opinion here, I do realize that it *is* possible that this easier for certain people. – David Arenburg Jan 27 '15 at 11:54
  • 1
    @DavidArenburg all I can counter with is that my experience is that gather/spread fits most people's brains better, and anecdotally it's easier for most people to learn. The point of tidyr __is__ to do less than reshape2, but it's the right less. Have you read the tidy data paper? – hadley Jan 27 '15 at 16:32
  • @hadley, I once took a look, but it was long ago. Did you update it to match the new philosophy? – David Arenburg Jan 27 '15 at 17:10
  • @DavidArenburg update is in vignette: http://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html – hadley Jan 27 '15 at 19:48
  • @DavidArenburg coming late to this, but I think some of your difficulty with gather/spread may be that you already understand melt/dcast. A shift in understanding can sometimes be more difficult than a first learn. I'm teaching an introductory R course now, and presented both. On the homework I gave where they could use either, most of the students tended toward the tidyr functions. – Gregor Thomas Jan 29 '15 at 20:36
  • @Gregor, that's an interesting thought- I've actually encountered it myself too. The students course is also an interesting use case. I guess I'm already from the old generation then :). – David Arenburg Jan 29 '15 at 20:56