2

I have an Excel (.xlsx) file with 40 or so sheets. Each sheet has the same structure, but contains different data. I want to extract information from each sheet and consolidate it into one, with the information from each sheet stacked one over the other. The information I need to extract from each are two:

  1. Sheet name, found always in cell E3
  2. Cell area of interest, found always between rows 72-85 and columns E-V

This extracted information would be pasted in rows 2-15 of the consolidated sheet with the sheet name in one column and all other information in the columns next to it. Then, information from the next sheet extracted would be pasted in rows 16-29 and so forth.

I’m new to Python and I am running Python 3.6.1. I have found solutions on how to combine Excel files using Python, but nothing for extracting pieces of information from specified cells in an Excel sheets.

Any guidance would be very helpful.

Update 1: I managed to load into a data frame the area of one of my sheets using.

import pandas as pd
cis = pd.ExcelFile(r"C:\mypath\myfile.xlsx") 
df1 = cis.parse('1. Consumers', skiprows=[0,1,2,3,4,5,6],parse_cols=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]).drop([13])

However, I still have to add the sheet name to the data frame and then run this same loop across all of the sheets to then finally append the information from each sheet together in one dataframe/file.

StatsScared
  • 517
  • 6
  • 20

2 Answers2

3

This should be enough to get you started:

xls = pd.ExcelFile('excel_file_path.xls')
for sheet in xls.sheet_names:
    df = pd.read_excel(file_name, sheetname=sheet)

If you're already pretty familiar with excel, you'll pick up pandas quick. You should absolutely look into using jupyter notebook with python. Jupyter notebook will essentially create a spreadsheet like UI with tons of baked in functionality for running analysis and manipulating data.

References:

Use PANDAS to read multiple sheets from same workbook

Link to jupyter notebook documentation

Pandas tutorials

Practical business solutions with pandas for python

Yale Newman
  • 1,141
  • 1
  • 13
  • 22
  • Thanks for this. Another user mentioned 'openpyxl'. Is there a reason to prefer pandas over openpyxl? I can't figure out how to extract particular cells using pandas. – StatsScared Oct 02 '17 at 14:57
  • See my update. Does your code help me loop through files? – StatsScared Oct 02 '17 at 16:27
  • @StatsScared yes, if all sheets have the same structure, you should be able to handle all sheets with one simple loop. for selecting specific rows columns look into: https://pandas.pydata.org/pandas-docs/stable/indexing.html – Yale Newman Oct 02 '17 at 21:22
  • 1
    please accept answer and if you have more questions for other parts of this process, check what's already been asked, usually its already on here, otherwise post a question specific to your new issue. there is a huge community of pandas users on here constantly answering questions. i rarely have the chance to get to a question before someone else does. – Yale Newman Oct 02 '17 at 21:41
0

This would give you a single dataframe, and a 'Sheet' column that identifies which sheet it comes from:

def pd_read_excel_sheets_to_df(path):
    sheets_dictionary = pd.read_excel(path, sheet_name=None)
    for sheet_name, sheet in sheets_dictionary.items():
        sheet['Sheet'] = sheet_name
    return pd.concat(sheets_dictionary.values(), ignore_index=True)