Data frame from wide to long with multiple variables and ids R

Question

I have a dataframe with participants' judgments for two texts. Suppose each text has a correct answer and an identifier, and each text is judged multple times.

set.seed(123)
wide_df = data.frame('participant_id' = LETTERS[1:12]
                 , 'judgment_1' = round(rnorm(12)*100)
                 , 'correct_1' = round(rnorm(12)*100)
                 , 'text_id_1' = sample(1:12, 12, replace = F)
                 , 'judgment_2' = round(rnorm(12)*100)
                 , 'correct_2' = round(rnorm(12)*100)
                 , 'text_id_2' = sample(13:24, 12, replace = F)
                 )

So that:

   participant_id judgment_1 correct_1 text_id_1 judgment_2 correct_2 text_id_2
1               A        -56        40         4         43      -127        17
2               B        -23        11        10        -30       217        14
3               C        156       -56         1         90       121        22
4               D          7       179        12         88      -112        15
5               E         13        50         7         82       -40        13
...

I would want to convert this to the long format with the columns:

participant_id   text_id   judgment   correct
             A         4        -56        40
             A        17         43       127     
...

I found and followed the SO advice here:

wide_df %>% 
  gather(v, value, judgment_1:text_id_2) %>% 
  separate(v, c("var", "col")) %>% 
  arrange(participant_id) %>% 
  spread(col, value)

But that way of reshaping returns the error Error: Duplicate identifiers for rows (3, 6), (9, 12)

I think I do something conceptually wrong but can't quite find it. Where is my mistake? Thanks!

It is unclear how you wish to aggregate text_id_1 and text_id_2 into a single identifier "text_id". Similar with others. Do you want another identifier that is either 1 or 2 (call it "time") so that each observation has a participant_id and a time? — farnsy, Nov 14 '18 at 15:37
`text_id_1` and `text_id_2` are the identifiers belonging to judgment 1 and 2. So it can be one variable that belongs to each `case` just as judgment and correct. — ben_aaron, Nov 14 '18 at 15:40
Why are you gathering judgment_1 and Text_id_2 ? and Not Text_id_1 ? — user5249203, Nov 14 '18 at 15:42

score 3 · Answer 1 · answered Nov 14 '18 at 15:59

Here's a more dynamic way compared to my other answer. This does not require manually uniting the required columns but does rely on column name pattern.

wide_df %>% 
  gather(variable, value, -participant_id) %>% 
  mutate(
    variable = substr(variable, 1, nchar(variable)-2),
    rn = ave(1:length(participant_id), participant_id, variable, FUN = seq_along)
    ) %>% 
  spread(variable, value) %>% 
  select(-rn)

   participant_id correct judgment text_id
1               A      40      -56       4
2               A    -127       43      17
3               B      11      -23      10
4               B     217      -30      14
5               C     -56      156       1
6               C     121       90      22
7               D     179        7      12
8               D    -112       88      15
9               E      50       13       7
10              E     -40       82      13
11              F    -197      172      11
12              F     -47       69      19
13              G      70       46       9
14              G      78       55      24
15              H     -47     -127       2
16              H      -8       -6      20
17              I    -107      -69       8
18              I      25      -31      21
19              J     -22      -45       3
20              J      -3      -38      16
21              K    -103      122       5
22              K      -4      -69      23
23              L     -73       36       6
24              L     137      -21      18

score 2 · Accepted Answer · answered Nov 14 '18 at 16:06

Answer already exists here: https://stackoverflow.com/a/12466668/2371031

e.g.,

set.seed(123)
wide_df = data.frame('participant_id' = LETTERS[1:12]
                     , 'judgment_1' = round(rnorm(12)*100)
                     , 'correct_1' = round(rnorm(12)*100)
                     , 'text_id_1' = sample(1:12, 12, replace = F)
                     , 'judgment_2' = round(rnorm(12)*100)
                     , 'correct_2' = round(rnorm(12)*100)
                     , 'text_id_2' = sample(13:24, 12, replace = F)
)

dl <- reshape(data = wide_df, 
              idvar = "participant_id", 
              varying = list(judgment=c(2,5),correct=c(3,6),text_id=c(4,7)), 
              direction="long", 
              v.names = c("judgment","correct","text_id"),
              sep="_")

Result:

    participant_id time judgment correct text_id
A.1              A    1      -56      40       4
B.1              B    1      -23      11      10
C.1              C    1      156     -56       1
D.1              D    1        7     179      12
E.1              E    1       13      50       7
F.1              F    1      172    -197      11
G.1              G    1       46      70       9
H.1              H    1     -127     -47       2
I.1              I    1      -69    -107       8
J.1              J    1      -45     -22       3
K.1              K    1      122    -103       5
L.1              L    1       36     -73       6
A.2              A    2       43    -127      17
B.2              B    2      -30     217      14
C.2              C    2       90     121      22
D.2              D    2       88    -112      15
E.2              E    2       82     -40      13
F.2              F    2       69     -47      19
G.2              G    2       55      78      24
H.2              H    2       -6      -8      20
I.2              I    2      -31      25      21
J.2              J    2      -38      -3      16
K.2              K    2      -69      -4      23
L.2              L    2      -21     137      18

s_baldur · Answer 3 · 2018-11-14T16:01:27.873

Some playing around with data.table:

library(data.table)
superlong_df <- melt(wide_df, id.vars = "participant_id")
setDT(superlong_df)[, `:=`(varn = gsub(".*_(\\d)", "\\1", variable),
                           variable = gsub("_\\d$", "", variable))]
dcast(superlong_df, participant_id + varn ~ variable)[, !"varn"]

    participant_id correct judgment text_id
 1:              A      40      -56       4
 2:              A    -127       43      17
 3:              B      11      -23      10
 4:              B     217      -30      14
 5:              C     -56      156       1
 6:              C     121       90      22
 7:              D     179        7      12
 8:              D    -112       88      15
 9:              E      50       13       7
10:              E     -40       82      13
11:              F    -197      172      11
12:              F     -47       69      19
13:              G      70       46       9
14:              G      78       55      24
15:              H     -47     -127       2
16:              H      -8       -6      20
17:              I    -107      -69       8
18:              I      25      -31      21
19:              J     -22      -45       3
20:              J      -3      -38      16
21:              K    -103      122       5
22:              K      -4      -69      23
23:              L     -73       36       6
24:              L     137      -21      18
    participant_id correct judgment text_id

score 0 · Answer 4 · answered Nov 14 '18 at 15:43

Here's way with tidyr. Basically you'll do unite() + separate_rows(). unite to combine columns and separate_rows to get them in rows -

wide_df %>% 
  unite(text_id, text_id_1, text_id_2) %>% 
  unite(judgment, judgment_1, judgment_2) %>% 
  unite(correct, correct_1, correct_2) %>% 
  separate_rows(2:4, sep = "_")

   participant_id judgment correct text_id
1               A      -56      40       4
2               A       43    -127      17
3               B      -23      11      10
4               B      -30     217      14
5               C      156     -56       1
6               C       90     121      22
7               D        7     179      12
8               D       88    -112      15
9               E       13      50       7
10              E       82     -40      13
11              F      172    -197      11
12              F       69     -47      19
13              G       46      70       9
14              G       55      78      24
15              H     -127     -47       2
16              H       -6      -8      20
17              I      -69    -107       8
18              I      -31      25      21
19              J      -45     -22       3
20              J      -38      -3      16
21              K      122    -103       5
22              K      -69      -4      23
23              L       36     -73       6
24              L      -21     137      18

Data frame from wide to long with multiple variables and ids R

4 Answers4