Camelot pdf extraction has an issue while copying texts among span cells

Question

I am extracting data from PDFs using camelot and am faced with the following issue on 3. page of this datasheet. The problematic table is shown below:

The issue is inconsistency during the copying content of span cells. As you can see on the following picture span cells are correctly detected.

Even if the cells are detected correctly in the 3. column the content is copied to one of two spanned cells and in the 4. column the content is copied to two of three spanned cells. You can see the data I extracted as follow. There is always one missing cell per both columns.

And here is the code I used if you want to try it out;

table_areas=['86, 697, 529, 95'] # To ignore page borders
tables = camelot.read_pdf(single_source, pages='all', 
                          flavor = 'lattice', 
                          copy_text=['v'], 
                          line_scale = 110, 
                          table_regions=table_areas, 
                          flag_size = False, 
                          process_background=False)

Code (Colab):

!pip install "camelot-py[cv]" -q
!pip install PyPDF2==2.12.1
!apt-get install ghostscript

import camelot
import pandas as pd
from tabulate import tabulate
import re
import fitz

single_source = '/content/FDB9406_F085-D.PDF'
print("Extracting ", single_source, "...")

table_areas=['86, 697, 529, 95']
tables = camelot.read_pdf(single_source, pages='all', flavor = 'lattice', copy_text=['v'], line_scale = 110, table_regions=table_areas, flag_size = False, process_background=False)


print("Extracting ", single_source, "is finished!")

to visualize the tables:

for table in accurate_tables:
  print(table.parsing_report, table.shape, table._bbox)
  print(tabulate(table.df, headers='keys', tablefmt='psql'))
  camelot.plot(table, kind='grid').show()

print("Extracting ", single_source, "is finished!")

I wasn't able to replicate this issue. Can you share a google colab notebook sample with your code? - I got errors refering the deprecation of PdfFileReader... — Marco Aurelio Fernandez Reyes, Jan 13 '23 at 21:03
Check out [this](https://stackoverflow.com/questions/74939758/camelot-deprecationerror-pdffilereader-is-deprecated) for the issue you get. There is a workaround solution. And I added the code as well @MarcoAurelioFernandezReyes — Said Akyuz, Jan 23 '23 at 07:51
According to my observation, this happens only if the cells are spanned vertically and horizontally and there are some other cells that are not spanned horizontally on the same column with the cells two-dimensional spanned. Somehow each cell in the same row could have opposite values of vspan. (True or False) The issue caused by this attribute, but I still have no solution for it. — Said Akyuz, Jan 25 '23 at 08:29

Camelot pdf extraction has an issue while copying texts among span cells

0 Answers0