0

I currently work with multiple large datasets of the same row number but different column numbers. Now I need to calculate the rate of change between columns and add it to either a new object or to the existing object to go on with my analysis.

In my research on the web I usually only encounterd people trying to figure out rate of change in a column but not between those. Is the easiest way to just flip all my data?

I am very sorry for my vague description of my problem as R and english are not my first languages.

I hope you can still show me the direction to further my understanding of R.

Thank you in advance for any tipps you might have!

same11
  • 15
  • 6
  • I think you are looking for [this](https://stackoverflow.com/questions/44533527/applying-percentage-change-between-two-columns-same-row) – Vishal A. Dec 16 '21 at 12:33

1 Answers1

0

I recommend joining all the data together and then convert it into a 3NF normalized long format table:

library(tidyverse)

data1 <- tibble(
  country = c("A", "B", "C"),
  gdp_2020 = c(1, 8, 10),
  gdp_2021 = c(1, 8, 10),
  population_2010 = c(5e3, 6e3, 6e3),
  population_2020 = c(5.5e3, 6.8e3, 6e3)
)
data1
#> # A tibble: 3 x 5
#>   country gdp_2020 gdp_2021 population_2010 population_2020
#>   <chr>      <dbl>    <dbl>           <dbl>           <dbl>
#> 1 A              1        1            5000            5500
#> 2 B              8        8            6000            6800
#> 3 C             10       10            6000            6000

data2 <- tibble(
  country = c("A", "B", "C"),
  population_2021 = c(7e3, 8e3, 7e3),
  population_2022 = c(7e3, 7e3, 10e3)
)
data2
#> # A tibble: 3 x 3
#>   country population_2021 population_2022
#>   <chr>             <dbl>           <dbl>
#> 1 A                  7000            7000
#> 2 B                  8000            7000
#> 3 C                  7000           10000

list(
  data1,
  data2
) %>%
  reduce(full_join) %>%
  pivot_longer(matches("^(gdp|population)")) %>%
  separate(name, into = c("variable", "year"), sep = "_") %>%
  type_convert() %>%
  arrange(country, variable, year) %>%
  group_by(variable, country) %>%
  mutate(
    # NA for the first value because it does not have a precursor to calculate change
    change_rate = (value - lag(value)) / (year - lag(year))
  )
#> Joining, by = "country"
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   country = col_character(),
#>   variable = col_character(),
#>   year = col_double()
#> )
#> # A tibble: 18 x 5
#> # Groups:   variable, country [6]
#>    country variable    year value change_rate
#>    <chr>   <chr>      <dbl> <dbl>       <dbl>
#>  1 A       gdp         2020     1          NA
#>  2 A       gdp         2021     1           0
#>  3 A       population  2010  5000          NA
#>  4 A       population  2020  5500          50
#>  5 A       population  2021  7000        1500
#>  6 A       population  2022  7000           0
#>  7 B       gdp         2020     8          NA
#>  8 B       gdp         2021     8           0
#>  9 B       population  2010  6000          NA
#> 10 B       population  2020  6800          80
#> 11 B       population  2021  8000        1200
#> 12 B       population  2022  7000       -1000
#> 13 C       gdp         2020    10          NA
#> 14 C       gdp         2021    10           0
#> 15 C       population  2010  6000          NA
#> 16 C       population  2020  6000           0
#> 17 C       population  2021  7000        1000
#> 18 C       population  2022 10000        3000

Created on 2021-12-16 by the reprex package (v2.0.1)

Example: rate of change in the second row (gdp of country A) is 0 because it was the same in both 2021 and 2020.

danlooo
  • 10,067
  • 2
  • 8
  • 22
  • I'd want to calculate the roc between id and foo and add a new column with that. But imagine having 20 columns so I'd like to add 19 new columns with the rate of change from col1 to col2, col2 to col3 etc. – same11 Dec 16 '21 at 12:42
  • What does rate of change between `id` and `foo` mean? In my example, change of `foo` is the difference between `foo` in the current row and the previous row. – danlooo Dec 16 '21 at 12:44
  • So my data is structured like this: My rows are countries and my col's are variables in a given year. Sometimes I have data for a range of 20 years for each country sometimes it's only 3 years. I have around 20 of those different data sets. I need a way to get the rate of change between the years for each country. So the rate of change between "id" to "foo" which would be gross income 2019 to 2020 in bulgaria in my data. – same11 Dec 16 '21 at 12:48
  • Do you have one excel worksheet or file per year? How is time encoded? – danlooo Dec 16 '21 at 12:50
  • I have one excel sheet with around 500k entry points. It has one row for each country and around 100 col's I already split them up into variable groups in R so that i have objects "only containing one variable" over a given time. Those are my multiple objects I need to calculate the roc for. The time is only encoded in the col name itself as a year. – same11 Dec 16 '21 at 12:53
  • So you have columns e.g. country, gdp_2021, gdp_2020, inhabitants_2020, inhabitants_2021 etc? Do we have always the same delimiter (_ in my case)? – danlooo Dec 16 '21 at 12:56
  • You could think of it like that, yes! Its all in one very big excel sheet tho just stiched together. – same11 Dec 16 '21 at 12:57
  • @same11 I revised my answer – danlooo Dec 16 '21 at 13:09