1

I have a dataset that looks like this:

phrase      wo1sp     wo2sp     wo3sp     wo1sc     wo2sc     wo3sc
hello       dan       mark      todd      10        5         4
hello       mark      dan       chris     8         9         4
goodbye     mark      dan       kev       2         4         10
what        kev       dan       mark      4         5         5

And I'd like to change it to something like this:

phrase      sp      sc
hello       dan     10 
hello       mark    5
hello       todd    4
hello       mark    8
hello       dan     9
hello       chris   4
goodbye     mark    2
goodbye     dan     4
goodbye     kev     10
what        kev     4
what        dan     5
what        mark    5

So, I know the first thing to do here is group_by(phrase). What I'm not sure about is how to associate sp1 with sc1, sp2 with sc2, etc. and make those each into their own rows. I've seen some vaguely similar things using reshape and tidy, but they don't depend on there being coupled columns. I would basically just like to collapse the numbers in the column names.

I have a request: When you answer, would you mind explaining what the code itself does? Many of the things I've searched on StackExchange present a seemingly esoteric solution with no explanation of what's going on.

AmagicalFishy
  • 1,249
  • 1
  • 12
  • 36

1 Answers1

1
library("tidyverse")

test_set = tribble(~phrase,      ~wo1sp,     ~wo2sp,     ~wo3sp,     ~wo1sc,     ~wo2sc,     ~wo3sc,
                   "hello",       "dan",       "mark",      "todd",      10,        5,         4,
                   "goodbye",     "mark",      "dan",       "kev",       2,         4,         10,
                   "what",        "kev",       "dan",       "mark",      4,         5,         5)

test_set %>% 
  gather(key = col, value = val, -phrase) %>% 
  separate(col = col, into = c("num", "suffix"), sep = 3) %>% 
  spread(key = suffix, value = val) %>% 
  mutate(sc = as.numeric(sc)) %>% 
  select(-num)

Edit: I guess it's not necessary to split col into three columns, could just do sep = 3

svenhalvorson
  • 1,090
  • 8
  • 21
  • 1
    Probably the easiest way to understand this is to run the first two lines, then the first three, ect. What we're doing is combining all the values into one column, then we split up the former column names into wo#(sp/sc), then we spread the values back into columns. When we call the spread function, we will associate all the rows that have the same 'num' values, thus fusing them into a single row. – svenhalvorson Dec 18 '18 at 21:16
  • Ah, my attempt at simplifying my dataset has failed, unfortunately (I've edited my question to reflect this). If you extend the dataset a little, this solution doesn't work—it only works in the specific case where there is one row for each phrase (we lose information in the `gather` call). You've put me on the right track, though; I'm messing around with things currently. – AmagicalFishy Dec 19 '18 at 15:48
  • Okay @AmagicalFishy. Another idea I had that would deal with the duplicated values would be to `gather` all the sp columns in one data set and all the sc columns in another. You could then join them together on phrase and the numerical portion of the column name. – svenhalvorson Dec 19 '18 at 20:24
  • If we did that, we'd still lose information though, right? Unless the order of data is maintained in a way that everything would be put back together nicely – AmagicalFishy Dec 20 '18 at 03:28