How to break out an Excel file with multiple tables in one sheet into separate data frames?

Question

I'm importing an Excel file where a single sheet has multiple tables.

When importing into R using read_excel(), I noticed each table can be delimited from the rest by looking at col1: if it contains 2-4 rows of NAs followed by a row with a header, then what follows is a new table that I'd like to store as a data frame.

For example, the pattern below repeats across the file, each a separate table:

col1         col2       col3       col4
NA           header2    header3    header4  <-- use these headers and
NA           header2a   header3a   header4a <-- append these to the above
header       Number     Number     Number   <-- omit this row
attribute1   Number     Number     Number
attribute2   Number     Number     Number
attribute3   Number     Number     Number
NA           NA         NA         NA
NA           NA         NA         NA

From the above, I would like to pull everything below the headers, and omit NA only rows (which indicate the end of that table). Sample output:

whatever     header2_header2a   header3_header3a   header4_header4a
attribute1   Number             Number             Number
attribute2   Number             Number             Number
attribute3   Number             Number             Number

Finally, I'd like to apply that to the entire file, so that each table gets stored as its own data frame.

Is this even possible?

Gregor Thomas · Answer 1 · 2019-08-20T03:06:31.323

Something like this should work:

# Find the all-NA rows, split the data, and get rid of the all-NA rows
breaks = rowSums(is.na(d)) == ncol(d)
dlist = split(d, f = data.table::rleid(breaks))
dlist = Filter(function(x) !all(is.na(x)), dlist)

# clean the remaining data
dlist = lapply(dlist, function(x) {
  nm = sapply(x[1:2, -1], paste, collapse = "_")
  x = x[-(1:2),]
  names(x) = c("whatever", nm)
  x
})
dlist
# $`1`
#     whatever header2_header2a header3_header3a header4_header4a
# 3     header           Number           Number           Number
# 4 attribute1           Number           Number           Number
# 5 attribute2           Number           Number           Number
# 6 attribute3           Number           Number           Number
# 
# $`3`
#       whatever header2b_header2bb header3b_header3bb header4b_header4bb
# 11      header             Number             Number             Number
# 12 attribute1b             Number             Number             Number
# 13 attribute2b             Number             Number             Number
# 14 attribute3b             Number             Number             Number

You may want to do some sort type conversion to coerce the columns to better classes, but since your example data just has "number" I didn't bother.

Using this input data:

d = read.table(text = 'col1         col2       col3       col4
NA           header2    header3    header4
NA           header2a   header3a   header4a
header       Number     Number     Number
attribute1   Number     Number     Number
attribute2   Number     Number     Number
attribute3   Number     Number     Number
NA           NA         NA         NA
NA           NA         NA         NA
NA           header2b    header3b    header4b
NA           header2bb   header3bb   header4bb
header       Number     Number     Number
attribute1b   Number     Number     Number
attribute2b   Number     Number     Number
attribute3b   Number     Number     Number
NA           NA         NA         NA
NA           NA         NA         NA', header = T)

Hey @Gregor, thanks for this. I follow the logic (very clever!), but it seems my example data did not capture the structure of my actual data. So, your code worked with the sample but not the data I have. I'll see if I can get it to work; or if I'll need to share some of the data and get more help. — Khashir, Aug 20 '19 at 23:13

How to break out an Excel file with multiple tables in one sheet into separate data frames?

1 Answers1