"Efficient" can have different interpretations. Based on your description (especially the mention of generators) I'm guessing you mean memory and computational efficiency (use as little memory as possible and avoid repeated loops over the same data). With that thought, here's one go:
def df_gen(filename, sheet_names):
with xlrd.open_workbook(filename, on_demand=True) as xl_file:
for sheet in sheet_names:
yield pd.read_excel(
xl_file, sheetname=sheet, engine='xlrd').assign(source=sheet)
# tell xlrd to let the sheet leave memory
xl_file.unload_sheet(sheet)
This makes use of xlrd's "worksheets on demand" feature to avoid loading the entire Excel document into memory. Sheets are explicitly unloaded from memory after DataFrames are constructed. Because this uses yield
it's a generator and how many dataframes are simultaneously created depends on your usage. Here's an example usage passing this generator to pandas.concat
:
df = pd.concat(df_gen('file_name.xlsx', ['sheet1', 'sheet2']), ignore_index=True)
Note, though, that concat
materializes everything in the generator before doing the concatenation, so this doesn't necessarily end up being any more efficient than your example of building up a list except that my function deliberately manages the resource usage of the xlrd workbook. In this case I think you end up with 1 or 2 copies of the data in memory at once depending on the internals of concat
.
If you are really worried about memory you could use the generator to iteratively build a dataframe one sheet at a time:
# create a generator
gen = df_gen(str(filename), sheet_names)
# get starting point
df = next(gen)
# iterate over the rest of the generator
for next_df in gen:
df = df.append(next_df, ignore_index=True)
I'd expect this to be less computationally efficient than calling concat
with the entire set of desired dataframes at once, though I haven't researched whether that's really true. In this case I think you end up with only 1 copy of all the data in memory at once, plus one extra copy of a sheet's data for each loop through the generator.
You know your situation best, but unless these are some truly impressive Excel files I wouldn't put a ton of effort into optimizing memory and computation beyond what seem like clear wins. With that in mind, here's a short function that leverages the ability of pandas.read_excel
to read multiple sheets at once:
def sheets_to_df(filename, sheet_names):
df_dict = pd.read_excel(filename, sheetname=sheet_names)
return pd.concat(
(df.assign(source=sheet) for sheet, df in dfs.items()), ignore_index=True)
One thing to note that is when passing in a file name read_excel
will load the entire Excel document (e.g. does not make use of the "on demand" feature of xlrd). So while this is efficient in terms of lines of code, it's definitely not efficient in terms of memory. I think this briefly ends up with all the data in memory 2-3 times: once in df_dict
and once in the final concatenated dataframe (and possibly again depending on the internals of concat
). But once this function returns you're left with only the one copy in the final dataframe. If you were planning to read most of the sheets anyway this wouldn't be a huge waste (assuming they all fit in memory at least twice), but if you were planning to read only a small subset of the sheets this could be a bit wasteful.
I hope this helps! You can get this as a Jupyter notebook here: https://gist.github.com/jiffyclub/9ab668f63c3d0f9adf3e730dc37cd419