0

[Still new to this, let me know if I need to provide any more information to help with my problem]

Trying to help a labmate with some data analysis, but the column names in the data compound several variables.

In his data, each row represents a mean value. each column is named mean.x.y.z, where x, y, & z, refer to different values of different variables. What we need to do is split each column name into its composite parts, and create separate columns for each of those.

for a more concrete example, if this is the data we were provided:

messy <- tibble(
  id = c('P1', 'P2', 'P3'),
  value.neutral.up = c(1.2, 1.3, 1.2),
  value.neutral.down = c(2.1, 3.1, 2.1),
  value.valid.up = c(1.2, 1.3, 1.2),
  value.valid.down = c(2.1, 3.1, 2.1),
  value.invalid.up = c(1.2, 1.3, 1.2),
  value.invalid.down = c(2.1, 3.1, 2.1)
)
messy
# A tibble: 3 x 7
  id    value.neutral.up value.neutral.down value.valid.up value.valid.down value.invalid.up value.invalid.down
  <chr>            <dbl>              <dbl>          <dbl>            <dbl>            <dbl>              <dbl>
1 P1                 1.2                2.1            1.2              2.1              1.2                2.1
2 P2                 1.3                3.1            1.3              3.1              1.3                3.1
3 P3                 1.2                2.1            1.2              2.1              1.2                2.1

Value refers to the numeric value that we want to extract, neutral/valid/invalid & up/down refer to levels of two predictor variables.

With that in mind, our end goal is something that looks like:

tidy <- tibble(
  id = c( rep('P1', 6), rep('P2', 6), rep('P3', 6)),
  cue = rep(c('neutral', 'valid', 'invalid'),6),
  direction = rep(c('up', 'down'), 9),
  value = rep(c(1.2, 1.3, 1.4, 2.1, 3.1, 4.1), 3) 
)

tidy
# A tibble: 18 x 4
   id    cue     direction value
   <chr> <chr>   <chr>     <dbl>
 1 P1    neutral up          1.2
 2 P1    valid   down        1.3
 3 P1    invalid up          1.4
 4 P1    neutral down        2.1
 5 P1    valid   up          3.1
 6 P1    invalid down        4.1
 7 P2    neutral up          1.2
 8 P2    valid   down        1.3
 9 P2    invalid up          1.4
10 P2    neutral down        2.1
11 P2    valid   up          3.1
12 P2    invalid down        4.1
13 P3    neutral up          1.2
14 P3    valid   down        1.3
15 P3    invalid up          1.4
16 P3    neutral down        2.1
17 P3    valid   up          3.1
18 P3    invalid down        4.1

I've used reshape2::gather() for this kind of problem before, but only in cases where column names didn't compound more than one variable name. That is, if the columns were value.neutral, value.valid, etc, that would be fine, but I don't know how to deal with the additional .up/down bit.

Any recommendations for how to handle this? Even if somebody could help me to better phrase this problem (in searchable terms) I'd be appreciative.

Thanks folks!

Brett
  • 45
  • 8
  • Try `messy %>% gather(key,value,-id) %>% separate(key, into = c('val','cue','dir')) %>% select(-val)`, note that probably will marked as a dup. – A. Suliman Aug 06 '19 at 20:11
  • 1
    Definitely a duplicate now that I understand the solution. – Brett Aug 06 '19 at 20:55
  • Followup question that I don't see addressed in the post you linked to: while most of my columns have 3 part names, some have four. Is there a way to account for this using separate? Or would it be easier to place them in another data frame, adjust, and reinsert? I don't mind doing the latter, but I wanted to see if there was a cleaner/more pithy way. – Brett Aug 07 '19 at 13:55
  • We can adapt Gregor's answer from [here](https://stackoverflow.com/questions/37731324/split-or-separate-uneven-unequal-strings-with-no-delimiter). define `maxchar` using `messy %>% gather(key,value,-id) -> messy1;maxchar = max(stringr::str_count(messy1$key, pattern = '\\.'))` then do `tidyr::separate(messy1, key, into = paste0("key", 1:(maxchar+1)), sep = '\\.')` – A. Suliman Aug 07 '19 at 14:34
  • Thanks friend. This would have worked right out of the box if there was some consistency to how the columns were named, but this is a solid start and useful to know for the future. – Brett Aug 07 '19 at 15:10
  • You're most welcome friend ;). [Here](https://stackoverflow.com/questions/51872267/split-columns-in-dataframe-with-na) the last scenario it's most unlikely but it's good to be aware of. – A. Suliman Aug 07 '19 at 16:41

1 Answers1

1
library(dplyr)
library(reshape2)
library(tidyr)

reshape2::melt(messy,id.vars="id") %>%
    tidyr::separate("variable",into = c("drop","cue","direction")) %>%
    select(-drop)
   id     cue direction value
1  P1 neutral        up   1.2
2  P2 neutral        up   1.3
3  P3 neutral        up   1.2
4  P1 neutral      down   2.1
5  P2 neutral      down   3.1
6  P3 neutral      down   2.1
7  P1   valid        up   1.2
8  P2   valid        up   1.3
9  P3   valid        up   1.2
10 P1   valid      down   2.1
11 P2   valid      down   3.1
12 P3   valid      down   2.1
13 P1 invalid        up   1.2
14 P2 invalid        up   1.3
15 P3 invalid        up   1.2
16 P1 invalid      down   2.1
17 P2 invalid      down   3.1
18 P3 invalid      down   2.1
yusuzech
  • 5,896
  • 1
  • 18
  • 33
  • Thank you for this, I did not know about separate(). Really appreciate the added detail of dropping the 'drop' column too! – Brett Aug 06 '19 at 20:37