3

I have a dataset that looks like this:

phrase      wo1sp     wo2sp     wo3sp     wo1sc     wo2sc     wo3sc
hello       dan       mark      todd      10        5         4
hello       mark      dan       chris     8         9         4
goodbye     mark      dan       kev       2         4         10
what        kev       dan       mark      4         5         5

And I'd like to change it to something like this:

phrase      sp      sc
hello       dan     10 
hello       mark    5
hello       todd    4
hello       mark    8
hello       dan     9
hello       chris   4
goodbye     mark    2
goodbye     dan     4
goodbye     kev     10
what        kev     4
what        dan     5
what        mark    5

Many of the suggestions I've come across don't depend on the data columns being coupled with one another—so the suggestions end up losing information in things like gather calls. I asked this question yesterday.

I've solved it like this:

library("tidyverse")

test_set = tribble(~phrase,      ~wo1sp,     ~wo2sp,     ~wo3sp,     ~wo1sc,     ~wo2sc,     ~wo3sc,
                   "hello",       "dan",       "mark",      "todd",      10,        5,         4,
                   "goodbye",     "mark",      "dan",       "kev",       2,         4,         10,
                   "what",        "kev",       "dan",       "mark",      4,         5,         5,
                   "hello",       "mark",      "dan",       "mark",      4,         7,         10)

tmp_list <- list()
for (ii in 1:3) {
  selected_data <- test_set %>%
    group_by(phrase) %>%
    select(matches(paste("wo", ii, "|phrase", sep="")))
  names(selected_data) <- c("phrase", "sp", "sc")
  tmp_list[[ii]] <- selected_data
}
ds <- do.call(rbind, tmp_list)

Which gives me exactly what I want, but it feels... hacky. In R, I try to avoid loops and messing around with column names (until the final cleanup) as much as I can. Even the select statement feels funny. I'm confident there's a cleaner way of doing this, but I've spent a very long time with (mostly with tidyr) trying to figure out how, and have come up with nothing.

Does anyone have any ideas?

(The duplicated question seems to be the same, but the answer to this question is radically different [and way cooler])

AmagicalFishy
  • 1,249
  • 1
  • 12
  • 36

2 Answers2

7

data.table to the rescue..

library(data.table)
setDT(df)
melt(df, id.vars = "phrase", measure.vars = patterns(sp = "sp$", sc = "sc$"))
 #    phrase variable    sp sc
 #1:   hello        1   dan 10
 #2:   hello        1  mark  8
 #3: goodbye        1  mark  2
 #4:    what        1   kev  4
 #5:   hello        2  mark  5
 #6:   hello        2   dan  9
 #7: goodbye        2   dan  4
 #8:    what        2   dan  5
 #9:   hello        3  todd  4
#10:   hello        3 chris  4
#11: goodbye        3   kev 10
#12:    what        3  mark  5

The measure=patterns(sp = "sp$", sc = "sc$") allows you to specify groups of related-columns that follow a naming pattern, in this case all sp columns end with sp and the others with sc. Hence you get those two columns as output. The variable column is an additional information for you that tells you which original column of the column-group the data came from.

talat
  • 68,970
  • 21
  • 126
  • 157
2

The tidyverse solution (which is the package you are using).

library(tidyverse)

test_set2 <- test_set %>%
  mutate(ID = 1:n()) %>%
  gather(Col, Val, -phrase, -ID) %>%
  separate(Col, into = c("Col1", "Col2"), sep = 3) %>%
  spread(Col2, Val) %>%
  select(phrase, sp, sc) 

test_set2
# # A tibble: 12 x 3
#    phrase  sp    sc   
#    <chr>   <chr> <chr>
#  1 goodbye mark  2    
#  2 goodbye dan   4    
#  3 goodbye kev   10   
#  4 hello   dan   10   
#  5 hello   mark  5    
#  6 hello   todd  4    
#  7 hello   mark  4    
#  8 hello   dan   7    
#  9 hello   mark  10   
# 10 what    kev   4    
# 11 what    dan   5    
# 12 what    mark  5

In addition to the separate function, we can also use the extract function to separate the columns.

test_set2 <- test_set %>%
  mutate(ID = 1:n()) %>%
  gather(Col, Val, -phrase, -ID) %>%
  extract(Col, into = c("Col1", "Col2"), regex = "(^wo[0-9])(sc$|sp$)") %>%
  spread(Col2, Val) %>%
  select(phrase, sp, sc) 
www
  • 38,575
  • 12
  • 48
  • 84