3

I have several Excel files from which I would like to generate individual dataframes, which will be concatenations of an arbitrary subset of sheets within each file.

It's important that I be able to subset the sheets (here I do that by indexing in to the list of sheet_names), and also in some manner track provenance (as I do here with the assign method).

This code currently works, but I'm concerned that I'm missing some fundamental way to make it more efficient (for when I wind up with a file with, say, 20 sheets).

I've already reviewed several other questions1 pertaining to concatenating, say, multiple CSVs or Excel files, but have a hard time generalizing those to the problem of parsing sheets within one file, with respect to efficiency.

This question is partially driven by my having used generators to concatenate Excel files before, but I'm having a hard time generalizing those cases to this one.

Example Excel File on GitHub

import pandas as pd
import xlrd

import xlwt

def file_parser(file):
    df_list = []
    for x in file.sheet_names[1::]:
        df = file.parse(x).assign(Source=x)
        df_list.append(df)
    return df_list

1 How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

   Import multiple csv files into pandas and concatenate into one DataFrame

   Pands ExcelFile.parse() reading file in as dict instead of dataframe

Bonifacio2
  • 3,405
  • 6
  • 34
  • 54
iff_or
  • 880
  • 1
  • 11
  • 24
  • are the sheet names the same that you need? or do you never want the first sheet? – DJK Jul 15 '17 at 00:47
  • @djk47463 for the first few files, I've found it's enough to omit the first sheet, but this may not remain the case – iff_or Jul 15 '17 at 19:07
  • I see from your comments that you think it should be marked for closure. I thought that the person could, with edits, have a viable question, and I understand why you do not. I'll consider that when I next vote, thanks. – iff_or Oct 15 '18 at 19:30

1 Answers1

4

"Efficient" can have different interpretations. Based on your description (especially the mention of generators) I'm guessing you mean memory and computational efficiency (use as little memory as possible and avoid repeated loops over the same data). With that thought, here's one go:

def df_gen(filename, sheet_names):
    with xlrd.open_workbook(filename, on_demand=True) as xl_file:
        for sheet in sheet_names:
            yield pd.read_excel(
                xl_file, sheetname=sheet, engine='xlrd').assign(source=sheet)
            # tell xlrd to let the sheet leave memory
            xl_file.unload_sheet(sheet)

This makes use of xlrd's "worksheets on demand" feature to avoid loading the entire Excel document into memory. Sheets are explicitly unloaded from memory after DataFrames are constructed. Because this uses yield it's a generator and how many dataframes are simultaneously created depends on your usage. Here's an example usage passing this generator to pandas.concat:

df = pd.concat(df_gen('file_name.xlsx', ['sheet1', 'sheet2']), ignore_index=True)

Note, though, that concat materializes everything in the generator before doing the concatenation, so this doesn't necessarily end up being any more efficient than your example of building up a list except that my function deliberately manages the resource usage of the xlrd workbook. In this case I think you end up with 1 or 2 copies of the data in memory at once depending on the internals of concat.

If you are really worried about memory you could use the generator to iteratively build a dataframe one sheet at a time:

# create a generator
gen = df_gen(str(filename), sheet_names)

# get starting point
df = next(gen)

# iterate over the rest of the generator
for next_df in gen:
    df = df.append(next_df, ignore_index=True)

I'd expect this to be less computationally efficient than calling concat with the entire set of desired dataframes at once, though I haven't researched whether that's really true. In this case I think you end up with only 1 copy of all the data in memory at once, plus one extra copy of a sheet's data for each loop through the generator.

You know your situation best, but unless these are some truly impressive Excel files I wouldn't put a ton of effort into optimizing memory and computation beyond what seem like clear wins. With that in mind, here's a short function that leverages the ability of pandas.read_excel to read multiple sheets at once:

def sheets_to_df(filename, sheet_names):
    df_dict = pd.read_excel(filename, sheetname=sheet_names)
    return pd.concat(
        (df.assign(source=sheet) for sheet, df in dfs.items()), ignore_index=True)

One thing to note that is when passing in a file name read_excel will load the entire Excel document (e.g. does not make use of the "on demand" feature of xlrd). So while this is efficient in terms of lines of code, it's definitely not efficient in terms of memory. I think this briefly ends up with all the data in memory 2-3 times: once in df_dict and once in the final concatenated dataframe (and possibly again depending on the internals of concat). But once this function returns you're left with only the one copy in the final dataframe. If you were planning to read most of the sheets anyway this wouldn't be a huge waste (assuming they all fit in memory at least twice), but if you were planning to read only a small subset of the sheets this could be a bit wasteful.

I hope this helps! You can get this as a Jupyter notebook here: https://gist.github.com/jiffyclub/9ab668f63c3d0f9adf3e730dc37cd419

jiffyclub
  • 1,837
  • 2
  • 16
  • 9