0

What is the difference between gather, reshape, cast, and similar functions? I know they are all helpful in transitioning between long and wide data, but I am having trouble using them. The documentation tends to use terms like "id" variables and "time" variables, but I am not sure what is what.

I have a dataframe like this:

data <- data.frame(id = c(rep("A", 10), rep("B", 10), rep("C", 10)),
                   val = 1:30)

I am trying to reformat it to look like this:

res <- data.frame(A = 1:10,
                  B = 11:20,
                  C = 21:30)

How could I most easily accomplish this? Any tips. I know this is an "easy" question but I am stumped. Thanks in advance.

Michael
  • 41,989
  • 11
  • 82
  • 128
cgibbs_10
  • 176
  • 1
  • 12
  • 1
    Probably the easiest solution is `unstack` in base R, `unstack(x=data, val ~ id)`. – lmo Apr 15 '18 at 22:39

3 Answers3

2

Please use the search function prior to posting. This has been asked a lot here on SO!

In the tidyverse you can do:

data %>%
    group_by(id) %>%
    mutate(n = 1:n()) %>%
    ungroup() %>%
    spread(id, val) %>%
    select(-n)
## A tibble: 10 x 3
#       A     B     C
#   <int> <int> <int>
# 1     1    11    21
# 2     2    12    22
# 3     3    13    23
# 4     4    14    24
# 5     5    15    25
# 6     6    16    26
# 7     7    17    27
# 8     8    18    28
# 9     9    19    29
#10    10    20    30

Comment: I suggest executing the above line by line to see what each command does. Also note that

data %>%
    spread(id, val)

will produce an error (see @neilfws' explanation in the comment).

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • 2
    We should explain why `group_by` and `mutate` are required. Without them you'll see `Error: Duplicate identifiers for rows`, because there's no unique identifier for the 10 occurrences of A, B, C. So we effectively "label" each id with values 1-10, then `spread` can work. – neilfws Apr 15 '18 at 22:22
  • @neilfws Added a brief comment. I still think this question should be closed as a dupe. Any explanation I give doesn't do justice to the extensive coverage/discussion in the dupe link. – Maurits Evers Apr 15 '18 at 22:30
1

the tidyr package is a replacement for the reshape and reshape2 packages.

Therefore, the tidyr functions, spread() and gather() are replacements for reshape2::cast() and reshape2::melt(), respectively.

To spread your data as requested, you'll need to add another column to specify the row numbers in the output data frame, as follows.

data <- data.frame(id = c(rep("A", 10), rep("B", 10), rep("C", 10)),
                   val = 1:30,row=c(1:10,1:10,1:10))

library(tidyr)
data %>% spread(.,id,val)

...and the output:

> data %>% spread(.,id,val)
   row  A  B  C
1    1  1 11 21
2    2  2 12 22
3    3  3 13 23
4    4  4 14 24
5    5  5 15 25
6    6  6 16 26
7    7  7 17 27
8    8  8 18 28
9    9  9 19 29
10  10 10 20 30
> 

To drop the row variable, add the dplyr package and select() out the unwanted column.

library(tidyr)
library(dplyr)
data %>% spread(.,id,val) %>% select(-row)

...and the output:

> data %>% spread(.,id,val) %>% select(-row)
    A  B  C
1   1 11 21
2   2 12 22
3   3 13 23
4   4 14 24
5   5 15 25
6   6 16 26
7   7 17 27
8   8 18 28
9   9 19 29
10 10 20 30
>
Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • 1
    For the sake of completeness, optimised versions of `melt()` and `dcast()` are also available from the `data.table` package. These versions allow for melting/casting multiple value variables. – Uwe Apr 15 '18 at 22:44
1

All of these functions fundamentally do the same thing - they convert a data set from a wide format to a long format or vice versa. The differences are how they approach the task.

The reshape function is the base R method - it's been around forever. I find it to be cumbersome (I need to check the examples every time in order to use it), but it's perfectly functional.

If you start with a wide format, a simple example of going to a long format looks like this:

df_long <- reshape(df_wide,
  direction = "wide",
  ids = 1:nrow(df_wide), # required, but not very informative
  times = colnames(df_wide), # required - the factor labels for the variable differentiating a measurement from column 2 versus column 3,
  varying = 1:ncol(df_wide) # required - specify which columns need to be switched to long format.
  v.names = "measurement", # optional - the name for the variable which will contain all the values of the variables being converted to long format
  timevar = "times" # optional - the name for the variable containing the factor (with levels defined in the times argument.)
)

You can similarly go through this for a long format (direction = 'long') - set direction = wide, and the required arguments become optional, and the optional arguments (timevar, idvar and v.names) become required. (In theory, R can sometimes infer some of the variables, but I've never had good luck with this. I treat them as required whether they are or not.

The gather/spread functions are a much simpler alternative. One big difference: it's two commands rather than one, so you don't have to worry about which arguments are relevant to each. I see that at least 2 answers have popped up describing how these functions work, so I won't repeat what they have said.

Melissa Key
  • 4,476
  • 12
  • 21