Is it possible to read multiple tables from a sheet excel file using pandas ? Something like: read table1 from row0 until row100 read table2 from row 102 until row202 ...
Asked
Active
Viewed 2.7k times
20
-
1Why not just read it all in and then separate to different `DataFrame`s in python? – splinter Apr 12 '17 at 11:12
-
I don't know exactly how can I achieve that. – bsd Apr 12 '17 at 11:20
-
@bsd, do you know the total # of rows beforehand? – MaxU - stand with Ukraine Apr 12 '17 at 11:31
3 Answers
19
I wrote the following code to identify the multiple tables automatically, in case you have many files you need to process and don't want to look in each one to get the right row numbers. The code also looks for non-empty rows above each table and reads those as table metadata.
def parse_excel_sheet(file, sheet_name=0, threshold=5):
'''parses multiple tables from an excel sheet into multiple data frame objects. Returns [dfs, df_mds], where dfs is a list of data frames and df_mds their potential associated metadata'''
xl = pd.ExcelFile(file)
entire_sheet = xl.parse(sheet_name=sheet_name)
# count the number of non-Nan cells in each row and then the change in that number between adjacent rows
n_values = np.logical_not(entire_sheet.isnull()).sum(axis=1)
n_values_deltas = n_values[1:] - n_values[:-1].values
# define the beginnings and ends of tables using delta in n_values
table_beginnings = n_values_deltas > threshold
table_beginnings = table_beginnings[table_beginnings].index
table_endings = n_values_deltas < -threshold
table_endings = table_endings[table_endings].index
if len(table_beginnings) < len(table_endings) or len(table_beginnings) > len(table_endings)+1:
raise BaseException('Could not detect equal number of beginnings and ends')
# look for metadata before the beginnings of tables
md_beginnings = []
for start in table_beginnings:
md_start = n_values.iloc[:start][n_values==0].index[-1] + 1
md_beginnings.append(md_start)
# make data frames
dfs = []
df_mds = []
for ind in range(len(table_beginnings)):
start = table_beginnings[ind]+1
if ind < len(table_endings):
stop = table_endings[ind]
else:
stop = entire_sheet.shape[0]
df = xl.parse(sheet_name=sheet_name, skiprows=start, nrows=stop-start)
dfs.append(df)
md = xl.parse(sheet_name=sheet_name, skiprows=md_beginnings[ind], nrows=start-md_beginnings[ind]-1).dropna(axis=1)
df_mds.append(md)
return dfs, df_mds

Rotem
- 191
- 1
- 2
-
-
1
-
1Throws the following exception for me : ValueError: 'nrows' must be an integer >=0 – Shihab Ullah Aug 05 '22 at 07:43
-
-
1@NoobVB The threshold variable is a guess on what determines the difference between a table beginning and ending. It's a finicky heuristic in this sense. the variables n_values and n_values_deltas give some hints as they are a sum of all the elements in a row that ARE NOT NULL, followed by the difference between the previous row and the current row. It is in essence a form of "edge detection" where you can imagine a step function of index on the x axis and the deltas on the y axis the threshold is a constant line on the positive x and - x axis. When crossed a new table is defined. – ZdWhite Nov 02 '22 at 20:44
15
Assuming we have the following Excel file:
Solution: we are parsing the first sheet (index: 0
)
xl = pd.ExcelFile(fn)
nrows = xl.book.sheet_by_index(0).nrows
df1 = xl.parse(0, skipfooter= nrows-(10+1)).dropna(axis=1, how='all')
df2 = xl.parse(0, skiprows=12).dropna(axis=1, how='all')
EDIT: skip_footer
was replaced with skipfooter
Result:
In [123]: df1
Out[123]:
a b c
0 78 68 33
1 62 26 30
2 99 35 13
3 73 97 4
4 85 7 53
5 80 20 95
6 40 52 96
7 36 23 76
8 96 73 37
9 39 35 24
In [124]: df2
Out[124]:
c1 c2 c3 c4
0 78 88 59 a
1 82 4 64 a
2 35 9 78 b
3 0 11 23 b
4 61 53 29 b
5 51 36 72 c
6 59 36 45 c
7 7 64 8 c
8 1 83 46 d
9 30 47 84 d

123
- 595
- 6
- 18

MaxU - stand with Ukraine
- 205,989
- 36
- 386
- 419
4
First read in the entire csv
file:
import pandas as pd
df = pd.read_csv('path_to\\your_data.csv')
and then obtain the individual frames, for example using:
df1 = df.iloc[:100,:]
df2 = df.iloc[100:200,:]

splinter
- 3,727
- 8
- 37
- 82
-
1if it would be a CSV file we could simply use `skiprows` and `nrows` parameters. Unfortunately the `nrows` is not implemented for `pd.read_excel` – MaxU - stand with Ukraine Apr 12 '17 at 11:59