I have thousands of PDF files, composed only by tables, with this structure:
However, despite being fairly structured, I cannot read the tables without losing the structure.
I tried PyPDF2, but the data comes completely messed up.
import PyPDF2
pdfFileObj = open(pdf_file.pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
print(pageObj.extractText().split('\n')[0])
print(pageObj.extractText().split('/')[0])
I also tried Tabula, but it only reads the header (and not the content of the tables)
from tabula import read_pdf
pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content
Any thoughts?