One of the most tidyous exercises in data manipulation with R
is reading in data from sources that were not compiled for that purpose, and I always struggle when I face such problems. I would welcome any help and have thus compiled an example that illustrates some of the problems that I often face.
Suppose I have the following Excel-file:
and want to read it to a long data.frame that should look like this:
year country region group count
1 2015 Austria capital students 4747
2 2015 Austria capital toddlers 1781
3 2015 Austria capital workers 1443
4 2015 Austria cities students 3245
5 2015 Austria cities toddlers 404
6 2015 Austria cities workers 213
7 2015 Austria total students 7992
8 2015 Austria total toddlers 2185
9 2015 Austria total workers 1656
10 2015 Denmark capital students 289
11 2015 Denmark capital toddlers 3699
12 2015 Denmark capital workers 1518
13 2015 Denmark cities students 4659
14 2015 Denmark cities toddlers 2476
15 2015 Denmark cities workers 2495
16 2015 Denmark total students 4948
17 2015 Denmark total toddlers 6175
18 2015 Denmark total workers 4013
19 2016 Austria capital students 1836
20 2016 Austria capital toddlers 1130
21 2016 Austria capital workers 1981
22 2016 Austria cities students 809
23 2016 Austria cities toddlers 2126
24 2016 Austria cities workers 2267
25 2016 Austria total students 2645
26 2016 Austria total toddlers 3256
27 2016 Austria total workers 4248
28 2016 Denmark capital students 2251
29 2016 Denmark capital toddlers 2555
30 2016 Denmark capital workers 1829
31 2016 Denmark cities students 4722
32 2016 Denmark cities toddlers 2165
33 2016 Denmark cities workers 4373
34 2016 Denmark total students 6973
35 2016 Denmark total toddlers 4720
36 2016 Denmark total workers 6202
37 2017 Austria capital students 1181
38 2017 Austria capital toddlers 710
39 2017 Austria capital workers 3876
40 2017 Austria cities students 895
41 2017 Austria cities toddlers 994
42 2017 Austria cities workers 3199
43 2017 Austria total students 2076
44 2017 Austria total toddlers 1704
45 2017 Austria total workers 7075
46 2017 Denmark capital students 1155
47 2017 Denmark capital toddlers 4455
48 2017 Denmark capital workers 3292
49 2017 Denmark cities students 683
50 2017 Denmark cities toddlers 3565
51 2017 Denmark cities workers 561
52 2017 Denmark total students 1838
53 2017 Denmark total toddlers 8020
54 2017 Denmark total workers 3853
The main challenges are:
The "header" spans several lines. It could be possible to read in the header separately as described here but that does not offer a simple solution as to how to attach it to back the data if it is not meant to be pasted to a single value – the same applies to the column names.
There are many linked cells that will be
NA
or empty if read viaopen.xlsx
. That problem can be addressed usingtidyr::fill
, but that requires having an appropriate data structure for the headers first.Some headers are relevant categories while others are not ("non-workers" are redundant)
Some headers are not explicitly stated in the source and must be added manually ("total","region","group").
The alternative way to import this data is to read.xlsx only columns 3:11 and rows 4:12, use reshape to transform to long and add the other variables manually, i.e. specify some rep()
cascades and hope to specify correct permutations in order to label the fields correctly:
library(openxlsx)
library(reshape2)
library(dpylr)
t <- read.xlsx("h:/example.xlsx", cols=3:11, rows=4:12, colNames=FALSE) %>%
melt %>%
transmute(count = value) %>%
mutate(country = c("Austria","Denmark") %>% rep(each=3) %>% rep(times=9),
region = c("total","capital","cities") %>% rep(times=54/3),
year = c(2015:2017) %>% rep(each=6*3),
group = c("workers","students","toddlers") %>% rep(each=6) %>% rep(times=3))
Is there an elegant way to read this type of data automatically into R?