How to read table spread across multiple pages, using tabula_py or camelot

Question

Iam using tabula_py to read tables on a pdf. Some are big. I have a lot of cases where a table is on more than one page. Isuue is tabula_py is treating as new table for each page, instead of reading as one large table. Same issue with Camelot

Stefano Fiorucci - anakin87 · Answer 1 · 2020-06-16T13:35:14.720

You're right. Both Camelot and Tabula work page by page.

Anyway, you can write your custom function to know if tables are united. If so, you can merge their content and treat them together.

For example, I created this function to process Camelot output:

from numpy import allclose 

def are_tables_united(table1_dict,table2_dict):
    if table2['page']==(table1['page']+1):
        if len(table2['cols'])==len(table1['cols']):

            # extract the vertical coordinates of the tables
            _,y_bottom_table1,_,_=table1['_bbox']
            _,_,_,y_top_table2=table2['_bbox']

            page_height=792

            # If the first table ends in the last 15% of the page 
            # and the second table starts in the first 15% of the page
            if y_bottom_table1<.15*page_height and\
            y_top_table2>.85*page_height:

                table1_cols=table1['cols']
                table2_cols=table2['cols']

                table1_cols_width=[col[1]-col[0] for col in table1_cols]
                table2_cols_width=[col[1]-col[0] for col in table2_cols]

                # evaluate if the column widths of the two tables are similar

              return(allclose(table1_cols_width,table2_cols_width,atol=3,rtol=0))

        else:
            return False
    else:
        return False

Function arguments table1_dict and table2_dict are Camelot output tables __dict__ attributes.

For example:

tables=camelot.read_pdf(...)
table1_dict=tables[0].__dict__
table2_dict=tables[1].__dict__

Hi, how did you extracted table1 and table2 input params using camelot, how you are getting number for 'page' and _bbox returns Key error — Sharon, Jun 16 '20 at 01:47

How to read table spread across multiple pages, using tabula_py or camelot

1 Answers1