Iam using tabula_py to read tables on a pdf. Some are big. I have a lot of cases where a table is on more than one page. Isuue is tabula_py is treating as new table for each page, instead of reading as one large table. Same issue with Camelot
Asked
Active
Viewed 1,963 times
1 Answers
1
You're right. Both Camelot and Tabula work page by page.
Anyway, you can write your custom function to know if tables are united. If so, you can merge their content and treat them together.
For example, I created this function to process Camelot output:
from numpy import allclose
def are_tables_united(table1_dict,table2_dict):
if table2['page']==(table1['page']+1):
if len(table2['cols'])==len(table1['cols']):
# extract the vertical coordinates of the tables
_,y_bottom_table1,_,_=table1['_bbox']
_,_,_,y_top_table2=table2['_bbox']
page_height=792
# If the first table ends in the last 15% of the page
# and the second table starts in the first 15% of the page
if y_bottom_table1<.15*page_height and\
y_top_table2>.85*page_height:
table1_cols=table1['cols']
table2_cols=table2['cols']
table1_cols_width=[col[1]-col[0] for col in table1_cols]
table2_cols_width=[col[1]-col[0] for col in table2_cols]
# evaluate if the column widths of the two tables are similar
return(allclose(table1_cols_width,table2_cols_width,atol=3,rtol=0))
else:
return False
else:
return False
Function arguments table1_dict
and table2_dict
are Camelot output tables __dict__
attributes.
For example:
tables=camelot.read_pdf(...)
table1_dict=tables[0].__dict__
table2_dict=tables[1].__dict__

Stefano Fiorucci - anakin87
- 3,143
- 7
- 26
-
Hi, how did you extracted table1 and table2 input params using camelot, how you are getting number for 'page' and _bbox returns Key error – Sharon Jun 16 '20 at 01:47
-
You're right. I corrected code and tried to explain better. – Stefano Fiorucci - anakin87 Jun 16 '20 at 13:35