2

Iam using tabula_py to read tables on a pdf. Some are big. I have a lot of cases where a table is on more than one page. Isuue is tabula_py is treating as new table for each page, instead of reading as one large table. Same issue with Camelot

Sharon
  • 51
  • 3

1 Answers1

1

You're right. Both Camelot and Tabula work page by page.

Anyway, you can write your custom function to know if tables are united. If so, you can merge their content and treat them together.

For example, I created this function to process Camelot output:

from numpy import allclose 

def are_tables_united(table1_dict,table2_dict):
    if table2['page']==(table1['page']+1):
        if len(table2['cols'])==len(table1['cols']):

            # extract the vertical coordinates of the tables
            _,y_bottom_table1,_,_=table1['_bbox']
            _,_,_,y_top_table2=table2['_bbox']

            page_height=792

            # If the first table ends in the last 15% of the page 
            # and the second table starts in the first 15% of the page
            if y_bottom_table1<.15*page_height and\
            y_top_table2>.85*page_height:

                table1_cols=table1['cols']
                table2_cols=table2['cols']

                table1_cols_width=[col[1]-col[0] for col in table1_cols]
                table2_cols_width=[col[1]-col[0] for col in table2_cols]

                # evaluate if the column widths of the two tables are similar

              return(allclose(table1_cols_width,table2_cols_width,atol=3,rtol=0))

        else:
            return False
    else:
        return False 

Function arguments table1_dict and table2_dict are Camelot output tables __dict__ attributes.

For example:

tables=camelot.read_pdf(...)
table1_dict=tables[0].__dict__
table2_dict=tables[1].__dict__