Tidy multivariate data in R

Question

I have a dataset with the following structure:

example of dataset

the rows are participants in an experiment, and the columns are questions they answered. All the columns titled EC belong to one type of task, all those titled ART belong to another etc.

After reading the table into R, how do I tidy the data such that all questions belonging to one type of task are saved as a single variable? I basically want each type of task (all answers that all participants gave for that task) to be saved as separate variables which I can later do statistical analysis on.

I understand that gather and separate might be useful commands for this, but I don't completely understand how to use them here and I don't completely understand their syntax.

For example:

gather(data,key, value) - I think that 'key' should refer to the title I gave the variable? and the 'value' refer to the fields where the values related with that variable are locate? If so, what does 'data' refer to? I tried putting the name of the table in the 'data' field, but got an error saying 'Error: Invalid column specification'.

Can someone help?

pictures are usually reserved for plots. pasting the output of `dput(data)` would enable this question to be more reproducible — hrbrmstr, Oct 22 '16 at 18:11
Sorry, I tried to find out how to put the data in a clear table, but pasting it just made a mess of it. — Maria Gold, Oct 22 '16 at 18:32
`gather` is explained very nicely in this answer http://stackoverflow.com/a/26536296/4477364. — Joe, Oct 23 '16 at 08:57

hrbrmstr · Accepted Answer · 2016-10-22T18:57:33.793

1

There has to be a dup for this but if we simulate some data:

library(tidyr)
library(purrr)
library(dplyr)

This part just re-creates a data set like you seem to have. It's not necessary to understand this for the solution.

df <- map(1:16, ~sample(0:4, 10, replace=TRUE))
df <- as.data.frame(df)
df <- set_names(df, c(sprintf("EC%d", 1:4), sprintf("ART%d", 1:4), sprintf("IC%d", 1:4), sprintf("AQ%d", 1:4)))
df <- mutate(participant=sprintf("id%d", 10))

Here's what df ends up looking like:

df
##    EC1 EC2 EC3 EC4 ART1 ART2 ART3 ART4 IC1 IC2 IC3 IC4 AQ1 AQ2 AQ3 AQ4 participant
## 1    4   2   1   4    2    2    3    1   4   2   0   4   3   0   4   2        id10
## 2    3   4   1   0    1    1    1    2   3   4   0   4   2   1   4   3        id10
## 3    4   2   3   2    0    1    3    4   4   1   2   4   0   1   0   4        id10
## 4    1   4   0   3    2    3    1    2   0   2   1   1   1   3   3   1        id10
## 5    2   3   1   1    2    4    1    0   3   0   3   3   0   1   4   2        id10
## 6    4   0   1   1    1    4    2    0   3   0   1   3   3   3   2   0        id10
## 7    3   1   1   1    4    1    1    0   0   2   1   4   3   2   2   3        id10
## 8    0   4   0   1    4    4    2    4   0   1   1   3   1   1   4   0        id10
## 9    0   0   4   4    0    1    0    3   1   0   2   3   4   4   1   0        id10
## 10   2   0   2   1    4    2    3    4   3   4   4   4   3   0   4   4        id10

That seems to be in the format your data is.

If so, then, I think this is what you want:

df <- gather(df, answer, value, -participant)

head(df, 20)
##    participant answer value
## 1         id10    EC1     4
## 2         id10    EC1     3
## 3         id10    EC1     4
## 4         id10    EC1     1
## 5         id10    EC1     2
## 6         id10    EC1     4
## 7         id10    EC1     3
## 8         id10    EC1     0
## 9         id10    EC1     0
## 10        id10    EC1     2
## 11        id10    EC2     2
## 12        id10    EC2     4
## 13        id10    EC2     2
## 14        id10    EC2     4
## 15        id10    EC2     3
## 16        id10    EC2     0
## 17        id10    EC2     1
## 18        id10    EC2     4
## 19        id10    EC2     0
## 20        id10    EC2     0

You may or may not have an ID variable for the subject, but we don't know that since we really don't have your data.

edited Oct 22 '16 at 18:57

answered Oct 22 '16 at 18:19

hrbrmstr

77,368
11
139
205

Some basic questions: What does 'EC%d' mean? I mean the %d part. another basic thing - what exactly does '%>%' mean? #rbeginner – Maria Gold Oct 22 '16 at 18:40
And another quick one - map(1:16, ~sample(0:4, 10, replace=TRUE)) %>% is it 16 because you have 4 variables and each one has 4 columns? My actual data set has more than 4 columns, should I add all columns up and put that number instead of 16?with 10, I assume that should be the number of participants (or rows) I have, right? – Maria Gold Oct 22 '16 at 19:00
oh, sorry, I may have misunderstood what you did. is the first set of commands is just you re-creating the data I was referring to, not you tidying it? – Maria Gold Oct 22 '16 at 19:02
Ok, so when you do: df <- gather(df, answer, value, -participant) how does R know what to put in the 'answer', 'value' and 'participant' columns? Well, the 'participant' column was named the same before, but what about the other two? – Maria Gold Oct 22 '16 at 19:12
it's going to take all the columns, unless you exclude one or more of them, and gather them into a single one. – hrbrmstr Oct 22 '16 at 19:13
So the 'answer' only takes the top column, and then value takes the rest? – Maria Gold Oct 22 '16 at 19:15
right the column names are going to be in `answer` and the values associated them will be in `value`. there will be as many rows per participant as there are columns. – hrbrmstr Oct 22 '16 at 19:18
Ok, thanks a lot for your patience so far. Now, can 'df', 'answer' and 'value' be replaced with other words? or are these fixed expressions? And once I tidy the table, how do I set each type of task as a variable? Is it something like: df$EC= as.factor(df$EC) ? I'm not sure what 'entities' the $ sign can refer to exactly. – Maria Gold Oct 22 '16 at 19:23
I think that if I need to have these tasks as factors, they need to actually be column headers, right? How would change the tidy table you made to get there? – Maria Gold Oct 22 '16 at 19:40
Excluding the 'participant' column worked, but when I try excluding additional columns, I get an error saying 'Error: Unexpected symbol...' can I exclude more than one column? I tried putting a dash on the left of each column name I wish to exclude, or only before the first one and separating the rest with commas, I get the same error no matter what. – Maria Gold Oct 23 '16 at 21:27

Tidy multivariate data in R

1 Answers1