1

One of the most tidyous exercises in data manipulation with R is reading in data from sources that were not compiled for that purpose, and I always struggle when I face such problems. I would welcome any help and have thus compiled an example that illustrates some of the problems that I often face.

Suppose I have the following Excel-file: enter image description here

and want to read it to a long data.frame that should look like this:

   year country  region    group count
1  2015 Austria capital students  4747
2  2015 Austria capital toddlers  1781
3  2015 Austria capital  workers  1443
4  2015 Austria  cities students  3245
5  2015 Austria  cities toddlers   404
6  2015 Austria  cities  workers   213
7  2015 Austria   total students  7992
8  2015 Austria   total toddlers  2185
9  2015 Austria   total  workers  1656
10 2015 Denmark capital students   289
11 2015 Denmark capital toddlers  3699
12 2015 Denmark capital  workers  1518
13 2015 Denmark  cities students  4659
14 2015 Denmark  cities toddlers  2476
15 2015 Denmark  cities  workers  2495
16 2015 Denmark   total students  4948
17 2015 Denmark   total toddlers  6175
18 2015 Denmark   total  workers  4013
19 2016 Austria capital students  1836
20 2016 Austria capital toddlers  1130
21 2016 Austria capital  workers  1981
22 2016 Austria  cities students   809
23 2016 Austria  cities toddlers  2126
24 2016 Austria  cities  workers  2267
25 2016 Austria   total students  2645
26 2016 Austria   total toddlers  3256
27 2016 Austria   total  workers  4248
28 2016 Denmark capital students  2251
29 2016 Denmark capital toddlers  2555
30 2016 Denmark capital  workers  1829
31 2016 Denmark  cities students  4722
32 2016 Denmark  cities toddlers  2165
33 2016 Denmark  cities  workers  4373
34 2016 Denmark   total students  6973
35 2016 Denmark   total toddlers  4720
36 2016 Denmark   total  workers  6202
37 2017 Austria capital students  1181
38 2017 Austria capital toddlers   710
39 2017 Austria capital  workers  3876
40 2017 Austria  cities students   895
41 2017 Austria  cities toddlers   994
42 2017 Austria  cities  workers  3199
43 2017 Austria   total students  2076
44 2017 Austria   total toddlers  1704
45 2017 Austria   total  workers  7075
46 2017 Denmark capital students  1155
47 2017 Denmark capital toddlers  4455
48 2017 Denmark capital  workers  3292
49 2017 Denmark  cities students   683
50 2017 Denmark  cities toddlers  3565
51 2017 Denmark  cities  workers   561
52 2017 Denmark   total students  1838
53 2017 Denmark   total toddlers  8020
54 2017 Denmark   total  workers  3853

The main challenges are:

  • The "header" spans several lines. It could be possible to read in the header separately as described here but that does not offer a simple solution as to how to attach it to back the data if it is not meant to be pasted to a single value – the same applies to the column names.

  • There are many linked cells that will be NA or empty if read via open.xlsx. That problem can be addressed using tidyr::fill, but that requires having an appropriate data structure for the headers first.

  • Some headers are relevant categories while others are not ("non-workers" are redundant)

  • Some headers are not explicitly stated in the source and must be added manually ("total","region","group").

The alternative way to import this data is to read.xlsx only columns 3:11 and rows 4:12, use reshape to transform to long and add the other variables manually, i.e. specify some rep() cascades and hope to specify correct permutations in order to label the fields correctly:

library(openxlsx)
library(reshape2)
library(dpylr)

t <- read.xlsx("h:/example.xlsx", cols=3:11, rows=4:12, colNames=FALSE) %>%

  melt %>%

  transmute(count = value) %>%

  mutate(country = c("Austria","Denmark") %>% rep(each=3) %>% rep(times=9),
         region = c("total","capital","cities") %>% rep(times=54/3),
         year = c(2015:2017) %>% rep(each=6*3),
         group = c("workers","students","toddlers") %>% rep(each=6) %>% rep(times=3))

Is there an elegant way to read this type of data automatically into R?

mzuba
  • 1,226
  • 1
  • 16
  • 33
  • Probably yes but can you make an example dataset please (or a link to where one can be downloaded directly?). Without it you probably won't get any useful responses. – Remko Duursma Sep 14 '17 at 04:04
  • Also, you haven't stated whether your 'alternative way' actually works or not (and if it doesn't, why not?) – Remko Duursma Sep 14 '17 at 04:05
  • Thank you for your comments. The example dataset is linked above (the link text reads "Excel-file", it is placed directly above the screenshot of the dataset). The 'alternative way' works in the minimal dataset but I it requires manual specification of all headers, which is what I want to avoid. – mzuba Sep 14 '17 at 07:53

0 Answers0