0

I am trying to transform two sequences columns of my dataset from wide to long. I tried following this thread, but have not been able to follow it : Using pivot_longer with 2 groups of columns

Here is an example data set (sorry - not that great at generating sample data yet):

df <- data.frame(id = c(1,2,3,4,5,6,7),
                 wt = c(12,14,16,17,17,15,14),
                 gen = c("m", "f","m", "f","m","m", "f"),
                 start_date_1 = c("2022-04-03" ,"2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03"),
                 start_date_2 = c("2022-04-03" ,"2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03"),
                 start_date_3 = c("2022-04-03" ,"2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03"),
                 end_date_1 = c("2022-04-03" ,"2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03"),
                 end_date_2 = c("2022-04-03" ,"2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03"),
                 end_date_3 = c("2022-04-03" ,"2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03", "2022-04-03")
)

and here is the code I am trying:

test_pivot<- df %>%
  pivot_longer(cols = everything(),
               names_to = c(".value", "end_date"),
               names_pattern = "([a-z]+_)(\\d)")

The output I am hoping to get is:

ID  start_date     start_date_date   end_date     end_date_date
1   start_date_1   2022-04-03        end_date_1    2022-05-07
1   start_date_2   2022-04-08        end_date_2    2022-05-06
1   start_date_3   2022-04-22        end_date_3    2022-05-17
1   start_date_4   2022-04-01        end_date_4    2022-05-18

Within this dataset there are 100s of columns and 1000s of unique IDs. In the example dataset, the columns are consecutive, but in the actual dataset the columns could come in any order.

Thanks of your help

Phil
  • 7,287
  • 3
  • 36
  • 66
Cae.rich
  • 171
  • 7
  • I fixed the input data – Cae.rich Sep 29 '22 at 16:58
  • If you don't put the quotes on the date columns, it will return different vvalues – akrun Sep 29 '22 at 17:03
  • I guess you don't need the `start_date/end_date` columns with 1, 2, 3, 4, as it is same value. `df %>% select(-c(wt, gen)) %>% pivot_longer(cols = contains('date'), names_to = c(".value", "date"), names_pattern = "^(\\w+_\\w+)_(\\d+)$")` – akrun Sep 29 '22 at 17:08
  • Thanks of your input. In this case I do need the start_date/end_date even if they are the same value – Cae.rich Sep 29 '22 at 17:11

1 Answers1

1

Does this work?

out <- pivot_longer(df, -c(id, wt, gen), names_pattern = "(.*)_([0-9]+)",
                    names_to = c(".value", "num"))
out
# # A tibble: 21 x 6
#       id    wt gen   num   start_date end_date  
#    <dbl> <dbl> <chr> <chr> <chr>      <chr>     
#  1     1    12 m     1     2022-04-03 2022-04-03
#  2     1    12 m     2     2022-04-03 2022-04-03
#  3     1    12 m     3     2022-04-03 2022-04-03
#  4     2    14 f     1     2022-04-03 2022-04-03
#  5     2    14 f     2     2022-04-03 2022-04-03
#  6     2    14 f     3     2022-04-03 2022-04-03
#  7     3    16 m     1     2022-04-03 2022-04-03
#  8     3    16 m     2     2022-04-03 2022-04-03
#  9     3    16 m     3     2022-04-03 2022-04-03
# 10     4    17 f     1     2022-04-03 2022-04-03
# # ... with 11 more rows

If you don't need wt, or gen, then you can de-select them. The num field is a proxy for your start_date and end_date fields: since they are perfectly correlated (due to the trailing number on the original _date_ columns).

To perfectly reproduce your results with the aesthetics, then

out %>%
  select(-wt, -gen) %>%
  rename(start_date_date = start_date, end_date_date = end_date) %>%
  mutate(start_date = paste0("start_date_", num), end_date = paste0("end_date_", num)) %>%
  select(id, start_date, start_date_date, end_date, end_date_date)
# # A tibble: 21 x 5
#       id start_date   start_date_date end_date   end_date_date
#    <dbl> <chr>        <chr>           <chr>      <chr>        
#  1     1 start_date_1 2022-04-03      end_date_1 2022-04-03   
#  2     1 start_date_2 2022-04-03      end_date_2 2022-04-03   
#  3     1 start_date_3 2022-04-03      end_date_3 2022-04-03   
#  4     2 start_date_1 2022-04-03      end_date_1 2022-04-03   
#  5     2 start_date_2 2022-04-03      end_date_2 2022-04-03   
#  6     2 start_date_3 2022-04-03      end_date_3 2022-04-03   
#  7     3 start_date_1 2022-04-03      end_date_1 2022-04-03   
#  8     3 start_date_2 2022-04-03      end_date_2 2022-04-03   
#  9     3 start_date_3 2022-04-03      end_date_3 2022-04-03   
# 10     4 start_date_1 2022-04-03      end_date_1 2022-04-03   
# # ... with 11 more rows
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for your response. I tried the code and get this error message : Error in `as_indices_impl()`: ! Must subset columns with a valid subscript vector. ✖ Subscript has the wrong type `tbl_df< Original Tag : character Calving Date 1 : datetime I think it has to do with how my dates are formatted. Is there an easy was I can re format them either before of after this code? – Cae.rich Sep 30 '22 at 08:28
  • I seem to have figured out the num/character problem.... but not I get this : Error in `vec_cbind()`: ! Can't recycle `..1` (size 2940) to match `..3` (size 0). – Cae.rich Sep 30 '22 at 08:45
  • 1
    Got it! It worked. In my actual dataset there was no _ in the column name so I removed the _ in the names_pattern = "(.*)_([0-9]+)" – Cae.rich Sep 30 '22 at 08:48