0

For an excel file with multiple row of headers as such (test data for downloading from here):

enter image description here

How could I skip rows Unit and Frequency and use indicator_name as header of excel file while reading excel with R?

With code below, it seems I could only skip one row by setting skip parameter an integer.

library(readxl)
myCols <- as.character(read_excel("./test123.xlsx", n_max = 1, col_names = FALSE))
myDF <- read_excel("./test123.xlsx", skip = 2, col_names = myCols)

Reference:

Skip rows while use read_excel or read.excel in R

ah bon
  • 9,293
  • 12
  • 65
  • 148

1 Answers1

1

You just need to skip = 3 instead of 2, as you need to skip the header when you read in the data. Since we have already defined column names in myCols, then we do not need to keep the column name row when you read it in.

library(readxl)

myCols <- as.character(read_excel("./test123.xlsx", n_max = 1, col_names = FALSE))
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)

Output

  indicator_name         M2   GDP
  <dttm>              <dbl> <dbl>
1 2018-01-01 00:00:00  6.71  8.17
2 2018-01-02 00:00:00  6.79  8.19
3 2018-01-03 00:00:00  6.77  8.21
4 2018-01-04 00:00:00  6.73  8.20
5 2018-01-05 00:00:00  6.67  8.20
6 2018-01-06 00:00:00  6.62  8.21
7 2018-01-07 00:00:00  6.62  8.21
8 2018-01-08 00:00:00  6.64  8.22
9 2018-01-09 00:00:00  6.64  8.22

If you have the first column name blank, then you could replace the NA in the column names before reading in the data.

library(tidyverse)

myCols <- read_excel("./test123.xlsx", n_max = 2, col_names = FALSE) %>% 
  slice(1) %>% 
  mutate(across(everything(), ~replace_na(., "indicator_name"))) %>% 
  as.character()
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)

Benchmark

In this instant, it looks like it would still be faster to just filter out the rows after reading them in.

enter image description here

bm <- microbenchmark::microbenchmark(filter_before = {myCols <- as.character(read_excel("./test123.xlsx", n_max = 1, col_names = FALSE));
myDF <- read_excel("./test123.xlsx", skip = 3, col_names = myCols)},
filter_after = {myDF2 <- read_excel("./test123.xlsx");
myDF2 <- myDF2[-c(1:2),]},
times = 1000)
autoplot(bm)
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
  • It works, thanks for your help again, btw I found that when `indicator_name` cell is filled with `NA`, then I'll get `Error: Sheet 1 has 176 columns (176 unskipped), but col_names has length 175`. – ah bon Mar 29 '22 at 06:11
  • 1
    @ahbon That seems like a pretty common scenario with excel files. I just added an additional part that you could fix the `NA` in the names, before reading in the excel so that you have the correct number of column names. I also added a benchmark as I was curious if it was quicker to just filter these rows after the fact. – AndrewGB Mar 29 '22 at 06:21
  • 1
    Many thanks for your very detailed solutions and explanations. – ah bon Mar 29 '22 at 06:57