2

I have the following stylized wide data frame, df_wide, that I would like to reshape long in R:

df_wide = data.frame(country = c("A", "B"),
                gdp_1999 = c(100, 200),
                gdp_2000 = c(400,500),
                poverty_rate_1999 = c(35,40),
                poverty_rate_2000 = c(10,15),
                inequality_score_gini_1999 = c(20,25),
                inequality_score_gini_2000 = c(40,45)
                )
df_wide
  country gdp_1999 gdp_2000 poverty_rate_1999 poverty_rate_2000 inequality_score_gini_1999 inequality_score_gini_2000
1       A      100      400                35                10                         20                         40
2       B      200      500                40                15                         25                         45

I would like to use pivot_longer to reshape the data long by country-year, such that the data frame looks like this:

df_long = data.frame(country = c("A","A","B","B"),
                     year = c(1999,1999,2000,2000),
                     gdp = c(100,400,200,500),
                     poverty_rate = c(35,10,40,15),
                     inequality_score_gini = c(20,40,25,45))
df_long
country year gdp poverty_rate inequality_score_gini
1       A 1999 100           35                    20
2       A 1999 400           10                    40
3       B 2000 200           40                    25
4       B 2000 500           15                    45

How would I do that with pivot_longer? Kindly note that I have specifically picked variables with different numbers of underscores, because that is where related past posts don't provide adequate guidance for my dataset. The best I can do is with regular reshape, reshaping one stub pattern at a time, such as:

library(reshape)
long_data_gdp <- reshape(df_wide, 
                     varying = c("gdp_1999", "gdp_2000"),
                     idvar = "country",
                     direction = "long",
                     sep = "_")

I would really appreciate if the correct answer could provide two sub-answers: (1) an answer with starts_with() so everything could be done by stub names instead of names_pattern, because my attempt above with reshape fails when there is more than one underscore separator (i.e., poverty_rate and inequality_gini_score don't work); (2) an answer with names_pattern but that explains how to target the last underscore as a separator in a way that is easy for non-regex experts to understand.

Mike
  • 197
  • 1
  • 9

1 Answers1

4

This answers the names_pattern approach:

The trick is to use a regex which sets up the two groups: (1) all value variables (.*) and (2) the year variable ([0-9]{4}). We need to use brackets to specify the groups.

library(tidyverse)

df_wide %>% 
  pivot_longer(cols = !country,
               names_to = c(".value", "year"),
               names_pattern = "^(.*)_([0-9]{4})$")

#> # A tibble: 4 × 5
#>   country year    gdp poverty_rate inequality_score_gini
#>   <chr>   <chr> <dbl>        <dbl>                 <dbl>
#> 1 A       1999    100           35                    20
#> 2 A       2000    400           10                    40
#> 3 B       1999    200           40                    25
#> 4 B       2000    500           15                    45

Created on 2022-04-30 by the reprex package (v0.3.0)

Below the data:

df_wide <- tribble(~country, ~gdp_1999, ~gdp_2000, ~poverty_rate_1999, ~poverty_rate_2000, ~inequality_score_gini_1999, ~inequality_score_gini_2000,
                   "A",      100,      400,                35,                10,                         20,                         40,
                   "B",      200,      500,                40,                15,                         25,                         45)
TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • Thank you, @TimTeaFan! The solution clearly works and is very clever. A couple questions: 1) any tips on how to easily learn `regex` so I can think for myself next time? The `names_pattern` solutions always confuse me. 2) Is there any way to do this with `starts_with` or by manually specifying the stub names before the year at the end--kind of like what I started with `reshape` but able to handle variables with multiple underscores? – Mike Apr 30 '22 at 20:53
  • 1
    Regarding (1): I like playing around with [regex101](https://regex101.com). Further [this GitHub repo](https://github.com/ziishaned/learn-regex) is short and helpful. – TimTeaFan Apr 30 '22 at 21:34
  • 1
    Regarding (2) and the general approach with `starts_with`: I'm not sure if I understand how your desired approach should look like. We can use `starts_with` and other tidy select syntax in the `col` argument. We can also use regex here with `matches()`. Can you specify how a answer with `pivot_longer` should look like (even if the code might not be working)? – TimTeaFan Apr 30 '22 at 21:41
  • basically, instead of using the `".value"`, you would use something like `starts_with("gdp", "poverty_rate", "inequality_score_gini")` . Basically, what I tried to do with `reshape` above but failed because of the multiple underscores. – Mike Apr 30 '22 at 21:47