5

I have hundreds of Excel files with hundreds of tables.

I want to use python to automatically detect these tables and convert them to pandas dataframes.

Some key facts about these tables:

  • These tables are not named.
  • The positioning of these tables are also not known
  • One sheet could have multiple tables (usually separated by a few blank lines )
  • there could be annotations
  • a table will always have a cluster of cells with values and rows and column labels
  • the column labels might be on 1 row or 2 rows (think of nested headers in pandas df)

Here is an example of what the table look like.

enter image description here

enter image description here

Here is my proposed solution (I would like your input on this):

  1. start from the bottom right hand corner of a given sheet in excel (so working from bottom up)
  2. search for a cluster of cells with values in it. Say you find 4 cells (2 above and 2 below) with numbers in it. This would indicate that you have landed on a table
  3. you would then figure out the dimensions of this table, so the top left and bottom right, this would be the values of the df
  4. you would then figure out how many levels there are in the columns/header and extract the labels
  5. then you would figure out how many levels there are in the rows/index and extract the labels
  6. then move to the next table and repeat

What library to use?

  • xlrd: very fast but does not read cell formatting for .xlsx files. It does provide cell format information for .xls files but not .xlsx. We are focused on .xlsx.
  • openpyxl: very slow, uses tons of memory because it loads everything into memory. It does provide cell formatting information.

During my research I across these two questions which are similar but the solutions proposed are VBA based, whereas I need a purely python solution.

  1. How to automatically detect tables in Excel
  2. Using machine learning to parse excel file and extract table data with no named tables involved? If so, how do I get started?
Boosted_d16
  • 13,340
  • 35
  • 98
  • 158
  • is running vba from python an option or are you looking for purely python? – Finn Jan 28 '20 at 11:10
  • purely python, the code will be used in a flask app. – Boosted_d16 Jan 28 '20 at 11:30
  • This [question & answer](https://stackoverflow.com/q/62509367/7758804) show how to use the python `win32com` module to create an object, which gives access to all of the Excel methods. The answer has links to a number of resources. You can use record macro in Excel to determine the VBA methods to use with python to implement a script for finding used cells. – Trenton McKinney Jan 12 '21 at 05:17
  • The link you've shared is not related to what I'm trying to do. Using win32com is not an option. As I mentioned, the code is being used in a Flask webapp. – Boosted_d16 Jan 20 '21 at 00:40
  • I’m one of the authors of the questions you linked to. Did you manage to find a python solution for this? – Kim Stacks Jun 16 '21 at 18:05

0 Answers0