-1

I have a problem, that I can't seem to solve, despite having tried using tidyverse functions pivot_longer and pivot_widerm but I cannot seem to extract the year variable stored in a row below variable names.

Here is a screenshot from the data, the working example has translated variable names

The image above shows a screenshot from the data, the working example is identical, except that variable names have been translated into English.

Here is a minimal working example

library(tidyverse)
library(janitor)

file <- read_delim("C:/Users/Probst/Downloads/testdata.csv", 
                   delim = ";", 
                   escape_double = FALSE, 
                   trim_ws = TRUE,
                   col_types = list(col_double()), 
                   locale = locale(decimal_mark = ",",
                                   grouping_mark = ".")) |>
  clean_names() |> 
  select(-raumeinheit, # not interesting
         -aggregat)    # not interesting

The raw data (testdata.csv) is as follows 

spatialid;Raumeinheit;Aggregat;pop_male;pop_male;pop_male;avg_pop_age; avg_pop_age;avg_pop_age
;;;2017;2018;2019;2017;2018;2019
"01001";"Flensburg, Stadt";"kreisfreie Stadt";"44.086";"44.599";"44.904";"42,17";"42,03";"42,00"
"01002";"Kiel, Stadt";"kreisfreie Stadt";"120.809";"120.566";"120.198";"41,53";"41,59";"41,72"
"02000";"Hamburg, Stadt";"kreisfreie Stadt";"897.207";"902.048";"903.974";"41,67";"41,67";"41,66"

(read_delim(file=I(...) unfortunately did not work for a cleaner working example.)

You will get the same file, with the following code:

file2 <- tribble(~spatialid, ~pop_male_4, ~pop_male_5, ~pop_mal_6, ~avg_pop_age_7, ~avg_pop_age_8, ~avg_pop_age_9,
                NA,     2017, 2018, 2019, 2017, 2018, 2019,
                1001,   44086,  44599,  44904,  42.2, 42.0,    42,
                1002,   120809, 120566, 120198, 41.5, 41.6,  41.7,
                2000,   897207, 902048, 903974, 41.7, 41.7,  41.7)

pop_male is the male population for each spatial unit and avg_pop_age is the average age for each unit. The problem with the dataset is, that the year variable is stored in the row below the variable names, which I cannot seem to extract.

What I would like to end up with a "tidy" data frame, which would look like this:

# rouding errors
desired_result <- 
tribble(~spatial_id, ~year, ~pop_male, ~avg_pop_age,
        1001,       2017,  44086,      42.2,
        1001,       2018,  44599,      42.0,
        1001,       2019,  44904,      42,
        1002,       2017,  120809,     41.5, 
        1002,       2018,  120566,     41.6,
        1002,       2019,  120198,     41.7,
        2000,       2017,  897207,     41.7, 
        2000,       2018,  902048,     41.7,
        2000,       2019,  903974,     41.7)

Any help or hint is highly appreciated.

minimouse
  • 131
  • 10
  • The issue seems to be with merged cells - does this help? https://stackoverflow.com/questions/37509886/how-to-read-merged-excel-cells-with-r – Andrea M May 04 '22 at 12:03
  • Unfortunately no, since the data comes as csv (";"-delimited or as an xls file) and the openxlsx package only is able to deal with xlsx files, unfortunately. – minimouse May 04 '22 at 12:24
  • Your screenshot confused me - that's not a csv, right? csv files can't have formatting and merged cells – Andrea M May 04 '22 at 14:34

1 Answers1

1

For the first part of the question, it's probably easier to skip the first row and then rename the columns manually with colnames():

file <- read.delim("C:/Users/Probst/Downloads/testdata.csv",
           delim = ";",
           skip = 1)

colnames(file) <-  c("spatialid",
                     "Raumeinheit",
                     "Aggregat",
                     "pop_male.2017",
                     "pop_male.2018",
                     "pop_male.2019",
                     "avg_pop_age.2017",
                     "avg_pop_age.2018",
                     "avg_pop_age.2019")

If you use a dot to separate the measure from the year in the column names, you can then use this code. The names_pattern = argument contains a simple regular expression that says that it should split the column names into two: (.*) is the first and the second part (it means: any character in any number), and \\. means: separated by a literal dot. You can then pivot it wider so that you can reach your intended output.

file |>
  pivot_longer(cols = pop_male.2017:avg_pop_age.2019,
               names_to = c("measure", "year"),
               names_pattern = "(.*)\\.(.*)") |>
  pivot_wider(names_from = measure,
              values_from = value)

Output:

#  spatialid Raumeinheit      Aggregat         year  pop_male avg_pop_age
#  <chr>     <chr>            <chr>            <chr> <chr>    <chr>      
#1 01001     Flensburg, Stadt kreisfreie Stadt 2017  44.086   42,17      
#2 01001     Flensburg, Stadt kreisfreie Stadt 2018  44.599   42,03      
#3 01001     Flensburg, Stadt kreisfreie Stadt 2019  44.904   42,00      
#4 01002     Kiel, Stadt      kreisfreie Stadt 2017  120.809  41,53      
#5 01002     Kiel, Stadt      kreisfreie Stadt 2018  120.566  41,59      
#6 01002     Kiel, Stadt      kreisfreie Stadt 2019  120.198  41,72      
#7 02000     Hamburg, Stadt   kreisfreie Stadt 2017  897.207  41,67      
#8 02000     Hamburg, Stadt   kreisfreie Stadt 2018  902.048  41,67      
#9 02000     Hamburg, Stadt   kreisfreie Stadt 2019  903.974  41,66
Andrea M
  • 2,314
  • 1
  • 9
  • 27
  • Thank you very much. Your solution is very helpful and nicely written. Could you hint me at a solution that doesn't require dropping or skipping the first row ( and hence renaming columns manually)? The working example only contains two variables, but the "real" data does come with a couple of hundred variables which would make remaining by hand a time consuming (and error-prone) task. Thank you once again ! – minimouse May 04 '22 at 14:59