I have the following pdf located here. I have tried, tried and tried again, to read the tables from the pdf. I have listed everything I used so far.
I've tried tabulua
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf(r"pdf\10027183.pdf")
I've tried textract
import textract
text = textract.process(r"pdf\10027183.pdf", method='pdfminer')
And I've tried tika
from tika import parser
rawText = parser.from_file(r"pdf\10027183.pdf")
rawList = rawText['content'].splitlines()
I've tried pypdf
(PyPDF2 got merged back into pypdf)
from pypdf import PdfReader
def get_pdf_content(pdf_file_path):
reader = PdfReader(pdf_file_path)
content = "\n".join(page.extract_text().strip() for page in reader.pages)
content = " ".join(content.split())
return content
print(get_pdf_content(r"pdf\10027183.pdf"))
And I have tried pdftotext
import pdftotext
with open(r"C:\Users\jfriel\Downloads\10027183.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Iterate over all the pages
for page in pdf:
print(page)
# Just read the second page
print(pdf.read(2))
# Or read all the text at once
print(pdf.read_all())
Is there a way to read in tables from a pdf via python?
EDIT: This is the result for tabula
, only returns 6 rows the pdf has 11: