7

I have been given an excel spreadsheet: column names are in the first row, garbage text is in the second row, and the actual data begins in the third row. I want to use the readxl package to read this into a dataframe, keeping the column names from the first row but discarding the second row.

Simply reading all the rows into a dataframe and then deleting the first row won't work, because the garbage that's in the second row of the excel file won't match the data type of the column.

I'd like a way to do this without manually editing the excel file.

Matthew
  • 4,149
  • 2
  • 26
  • 53
  • 1
    You'll need to read in the whole file (which will give you the column names), and then manually remove row 2, and potentially rearrange column entries in the rows below row 2 to match column names. I found `read_excel` quite robust when it comes to "nonsense" lines: it will still read the file, but any potential cleaning is up to you. – Maurits Evers Aug 03 '18 at 13:18

2 Answers2

11

I would suggest reading the whole file, and then manually removing row 2.

As an example, here is a screenshot of a sample Excel file

enter image description here

We read the complete file, and remove row 1 (which corresponds to the second row in the Excel sheet)

library(readxl)
library(tidyverse)
df <- read_excel("Workbook1.xlsx")[-1, ] %>%
    map_df(~parse_guess(.))
df
## A tibble: 2 x 4
#      A     B     C     D
#  <int> <int> <int> <int>
#1    20    30    40    50
#2    30    40    50    60
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Since OP mentioned manually removing the offending row messes up his data types, a logical next step would be: `df %>% mutate_if(is.character, as.numeric)`. Of course this assumes there are no genuine character columns, but the `mutate` call can also be changed to address only those columns that need converting – tifu Aug 03 '18 at 13:35
  • 2
    Alternatively, after removing the second row, reparse all columns e.g. using purrr::map and readr::parse_guess: `df %>% map_df(~parse_guess(.))` `%` is from magrittr, all in the `tidyverse` – Jannik Buhr Aug 03 '18 at 13:54
  • @JannikBuhr Oh I like that. Thanks and added. – Maurits Evers Aug 03 '18 at 14:37
  • Appears to be [broken due to a regression](https://github.com/tidyverse/readr/issues/1058). This should work for all column types and handle missing values: `df <- read_excel(path = "Workbook1.xlsx", col_types = 'text') %>% slice(2:n()) %>% type_convert(na = c("", "NA"))` – Paul Dec 03 '21 at 15:11
11

Here's another solution:

First, read in the first row using readxl and save as an array (as it only imports the first row, this is fast):

col_names <- array(read_excel('C:/spreadsheet.xlsx', sheet = 'Sheet1', n_max = 1, col_names = FALSE))

Second, read in the same spreadsheet but start at your data:

df <- data.frame(read_excel('C:/spreadsheet.xlsx', sheet = 'Sheet1', skip = 2, col_names = FALSE))

Finally, rename the dataframe columns using the first step:

colnames(df) <- col_names
JamesR
  • 613
  • 8
  • 15