0

I am using tabula-py for extracting table from pdf. Where I am using lattice for parsing the file. It is doing good for all rows except the first one.

code:

df = read_pdf("filename.pdf", pages=21, multiple_tables=True, lattice=True)

Table in pdf: enter image description here

Output from Tabula: enter image description here

There are multiple table tables with varying area and number of columns in the pdf. As you can see in image lattice is working good for 2 and 3rd rows and for 1st row it is not working good.

I tried camelot library but it is giving deprecation error of pypdf2.

  • I have no experience with tabula, but for your case, I can suggest using "camelot" which is more reliable. And using the "lattice" parsing method is a better way for tables with borders like you provided in the question. – Said Akyuz Jan 10 '23 at 13:43
  • For deprecation error, there is a workaround solution. Check out: https://stackoverflow.com/questions/74939758/camelot-deprecationerror-pdffilereader-is-deprecated – Said Akyuz Jan 10 '23 at 13:45

0 Answers0