5

I would like to extract multiple tables of a series of excel spreadsheets where some sheets may contain more than one table, to store the tables separately as e.g. csv files. The table could be something like this:

enter image description here

if I read it using pandas read_excel

import pandas as pd
pd.read_excel('table_example.xlsx',header=None)

I would get something like this:

enter image description here

how could I extract the different tables? In my case tables have NaN values, which is potentially an additional complication.

[EDIT1] something similar to the excel tables can be generated using pandas:

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

I have tried to find the limits of the tables using built-in pandas functions:

df[df.isnull().all(axis=1)]

I could use the first and second row to set an horizontal division and maybe make a first split but I don't know how to select the cells above or below the identified rows. Or even if this is the most convenient approach.

disclaimer: in my case the tables always have a tag in the row above the header, this is because these tables are read by a non-python software which uses them to identify where the table starts. I decided to not consider these tags to ask a more generic problem, that other people may encounter.

Nabla
  • 1,509
  • 3
  • 20
  • 35
  • The pandas parsing code is already quite complex in an effort to do the *right thing* in most cases. This seems unlikely to ever be one of the *right* cases. So I would suggest that it would seem that the markers are there for a good reason. – Stephen Rauch Apr 02 '17 at 21:44
  • I'm lazy but if you provide a dump of your example I can load instead of images I may show you an possible approach – Zeugma Apr 03 '17 at 13:48
  • 1
    Assuming your tables are nicely structured, you can try to identify entire rows of NaN, and use those indexes to slice your df (This assumes that each table has at least blank row before and after the table). Then for each slice, just drop all the NaN columns. – Yeile Apr 03 '17 at 14:36

2 Answers2

5

This may help to dynamically locate and extract the table as long as the 2 tables are separated by either a row or column of NaNs.

I used the boundingbox solution from https://stackoverflow.com/a/54675526

from skimage.measure import label, regionprops
#this basically converts your table into 0s and 1s where 0 is NaN and 1 for non NaN 
binary_rep = np.array(df.notnull().astype('int'))

list_of_dataframes = []
l = label(binary_rep)
for s in regionprops(l):
    #the bbox contains the extremes of the bounding box. So the top left and bottom right cell locations of the table.
    list_of_dataframes.append(df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]])

Roshan
  • 51
  • 1
  • 1
  • 1
    The code extracts single line and paragraph as tables. To avoid this, we filter the table/data frame which contains at least two columns. But tables with single column and multiple row will be discarded. In most cases, table contains at least two columns, so it is ok to apply the filter in loop to avoid single line and paragraph. Here is the updated code in for loop **if df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]].shape[1]>=2: list_of_dataframes.append(df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]])** – Govinda Oct 06 '21 at 12:59
4
import numpy as np
import pandas as pd

# I have assumed that the tables are "separated" by at least one row with only NaN values

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

print(df)

# Extract by rows

nul_rows = list(df[df.isnull().all(axis=1)].index)

list_of_dataframes = []
for i in range(len(nul_rows) - 1):
    list_of_dataframes.append(df.iloc[nul_rows[i]+1:nul_rows[i+1],:])


# Remove null columns

cleaned_tables = []
for _df in list_of_dataframes:
    cleaned_tables.append(_df.dropna(axis=1, how='all'))

# cleaned_tables is a list of the dataframes

print(cleaned_tables[0])
print(cleaned_tables[1])
gxpr
  • 836
  • 9
  • 12