1

This is a question for all the Tidyverse experts out there. I have a dataset with lots of different classes (datettime, integer, factor, etc.) and want to use tidyr to gather multiple variables at the same time. In the reproducible example below I would like to gather time_, factor_ and integer_ at once, while id and gender remain untouched.

I am looking for the current best practice solution using any of the Tidyverse functions.

(I'd prefer if the solution isn't too "hacky" as I have a dataset with dozens of different key variables and around five hundred thousand rows).

Example data:

library("tidyverse")
data <- tibble(
  id = c(1, 2, 3),
  gender = factor(c("Male", "Female", "Female")),
  time1 = as.POSIXct(c("2014-03-03 20:19:42", "2014-03-03 21:53:17", "2014-02-21 12:13:06")),
  time2 = as.POSIXct(c("2014-05-28 15:26:49 UTC", NA, "2014-05-24 10:53:01 UTC")),
  time3 = as.POSIXct(c(NA, "2014-09-26 00:52:40 UTC", "2014-09-27 07:08:47 UTC")),
  factor1 = factor(c("A", "B", "C")),
  factor2 = factor(c("B", NA, "C")),
  factor3 = factor(c(NA, "A", "B")),
  integer1 = c(1, 3, 2),
  integer2 = c(1, NA, 4),
  integer3 = c(NA, 5, 2)
)

Desired outcome:

# A tibble: 9 x 5
     id gender Time                Integer Factor
  <dbl> <fct>  <dttm>                <dbl> <fct> 
1     1 Male   2014-03-03 20:19:42       1 A     
2     2 Female 2014-03-03 21:53:17       3 B     
3     3 Female 2014-02-21 12:13:06       2 C     
4     1 Male   2014-05-28 15:26:49       1 B     
5     2 Female NA                       NA NA    
6     3 Female 2014-05-24 10:53:01       4 C     
7     1 Male   NA                       NA NA    
8     2 Female 2014-09-26 00:52:40       5 A     
9     3 Female 2014-09-27 07:08:47       2 B 

P.S. I did find a couple of threads that scratch the surface of gathering multiple variables, but none deal with the issue of gathering different classes and describe the current state of the art Tidyverse solution.

captain
  • 543
  • 1
  • 3
  • 20
  • 1
    [Reshaping multiple sets of measurement columns (wide format) into single columns (long format)](https://stackoverflow.com/questions/12466493/reshaping-multiple-sets-of-measurement-columns-wide-format-into-single-columns) – Henrik Dec 16 '18 at 22:31
  • That's one of the threads that I consulted, but it doesn't fully deal with the problem. E.g. the best solution using tidyr provided in the above thread is `data %>% gather(key = variable, value = value, -id, -gender) %>% mutate(wave = readr::parse_number(variable)) %>% mutate(variable = gsub("\\d","", x = variable)) %>% spread(variable, value)`. Howeverk this does not preserve any of the different classes - it recodes them to character. Also, it issues the warning message "attributes are not identical across measure variables; they will be dropped". – captain Dec 16 '18 at 22:55
  • Adding `convert=T` to the `spread` and `gather` above helps with some of the class issues, however i'm not sure about the date time one. – NColl Dec 16 '18 at 23:01

2 Answers2

0

Probably too repetitive for what you want, but using mutate_at to recode multiple variables at the end when dealing with a large number of variables may be an option

Changing them all to character at the start maintains the time data then it needs to be converted back to date time at the end

 data %>% 
  mutate_all(funs(as.character)) %>%
  gather(key = variable, value = value, -id, -gender, convert = T) %>%
  mutate(wave = readr::parse_number(variable),
         variable = gsub("\\d","", x = variable)) %>% 
  spread(variable, value, convert = T) %>%
  mutate(time = as.POSIXct(time),
         factor = factor(factor),
         gender = factor(gender)) %>%
  select(1, 2, 6, 5, 4)

 # A tibble: 9 x 5
  id    gender time                integer factor
  <chr> <fct>  <dttm>                <int> <fct> 
1 1     Male   2014-03-03 20:19:42       1 A     
2 1     Male   2014-05-28 15:26:49       1 B     
3 1     Male   NA                       NA NA    
4 2     Female 2014-03-03 21:53:17       3 B     
5 2     Female NA                       NA NA    
6 2     Female 2014-09-26 00:52:40       5 A     
7 3     Female 2014-02-21 12:13:06       2 C     
8 3     Female 2014-05-24 10:53:01       4 C     
9 3     Female 2014-09-27 07:08:47       2 B   
NColl
  • 757
  • 5
  • 19
  • Thank you for the suggestion. This does produce the desired output, but I am a bit wary about mutating all variables to character at the beginning and then having to mutate them back to the original class. This does get really exhausting when there are hundreds of columns, potentially loses valuable information, and introduces a lot of room for making errors along the way. – captain Dec 17 '18 at 10:11
0

(I'm rewriting basically all of my previous answer but keeping as this post to preserve comments.)

You can use some of the tidyselect helper functions, namely starts_with, to select batches of columns to gather, and then drop superfluous ones. This handles (some) of the issue of data types with gathering, because you're gathering sets of columns of the same type together, but it still requires re-coercing Factor into a factor because of the different factor levels present when gathering (see the warning message).

What I had trouble grasping was how the gathered columns would "move" while keeping some pattern with the id and gender columns. Doing a series of gather calls doesn't keep the pattern you want, but you can do each gather call and join them back together.

Here's one:

library(tidyverse)

data %>%
  select(id, gender, starts_with("time")) %>%
  gather(key = key_time, value = Time, starts_with("time"))
#> # A tibble: 9 x 4
#>      id gender key_time Time               
#>   <dbl> <fct>  <chr>    <dttm>             
#> 1     1 Male   time1    2014-03-03 20:19:42
#> 2     2 Female time1    2014-03-03 21:53:17
#> 3     3 Female time1    2014-02-21 12:13:06
#> 4     1 Male   time2    2014-05-28 15:26:49
#> 5     2 Female time2    NA                 
#> 6     3 Female time2    2014-05-24 10:53:01
#> 7     1 Male   time3    NA                 
#> 8     2 Female time3    2014-09-26 00:52:40
#> 9     3 Female time3    2014-09-27 07:08:47

To do all of these, you can map over the prefixes—"time," "factor," and "integer"—and reduce-join them together. The trick is that you need some unique identifier for each row in order to join properly; for this, I added a column with row_number, use it as a joining column, then drop it.

map(c("time", "factor", "integer"), function(p) {
  val_name <- str_to_title(p)
  data %>%
    select(id, gender, starts_with(p)) %>%
    gather(key = key, value = !!val_name, starts_with(p)) %>%
    select(-key) %>%
    mutate(row = row_number())
}) %>%
  reduce(left_join) %>%
  select(-row)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> Joining, by = c("id", "gender", "row")
#> Joining, by = c("id", "gender", "row")
#> # A tibble: 9 x 5
#>      id gender Time                Factor Integer
#>   <dbl> <fct>  <dttm>              <chr>    <dbl>
#> 1     1 Male   2014-03-03 20:19:42 A            1
#> 2     2 Female 2014-03-03 21:53:17 B            3
#> 3     3 Female 2014-02-21 12:13:06 C            2
#> 4     1 Male   2014-05-28 15:26:49 B            1
#> 5     2 Female NA                  <NA>        NA
#> 6     3 Female 2014-05-24 10:53:01 C            4
#> 7     1 Male   NA                  <NA>        NA
#> 8     2 Female 2014-09-26 00:52:40 A            5
#> 9     3 Female 2014-09-27 07:08:47 B            2

It's a little ugly, and won't fit well in a piped workflow already underway, but you could easily enough wrap it in a function:

gather_by_prefix <- function(.data, prefix) {
  map(prefix, function(p) {
    val_name <- str_to_title(p)
    data %>%
      select(id, gender, starts_with(p)) %>%
      gather(key = key, value = !!val_name, starts_with(p)) %>%
      select(-key) %>%
      mutate(row = row_number())
  }) %>%
    reduce(left_join) %>%
    select(-row)
}

Calling it like so gets the same output as above:

data %>%
  gather_by_prefix(c("time", "factor", "integer"))

As for keeping factor levels, I think unfortunately you'll need to coerce it back afterwards. There are other questions on possible ways around it; here's one.

It's also worth noting that the tidyr github has several issues filed on work being done to implement a multi_gather-type of function, likely for use cases like yours. Not sure if those would cover factor conversion.

camille
  • 16,432
  • 18
  • 38
  • 60
  • I think the OP wanted Time1 matched with Factor1 and Integer1; only 9 rows instead of 81. Needs one or two more steps to filter out the other rows. – Jon Spring Dec 17 '18 at 01:19
  • Okay, thanks. If that's the case, there's some logic in the question that I'm not seeing. @captain can you explain more clearly how you intend for reshaping to happen? – camille Dec 17 '18 at 01:33
  • Thank you for the suggestion. Indeed, I want to have 9 rows after reshaping the data, like in the example in my question. So each additional gather() introduces lots of additional unwanted rows. – captain Dec 17 '18 at 10:22
  • Regarding you comment on the error message when gathering factors: Even when I replace the NAs in the factor1-3 variables with A and C in order to make the levels identical, and I add `convert = TRUE`, the class gets recoded to character. Is there really no way to make sure that factors get preserved in a gather operation? – captain Dec 17 '18 at 10:26