3

I have the following pdf located here. I have tried, tried and tried again, to read the tables from the pdf. I have listed everything I used so far.

I've tried tabulua

import tabula

# Read pdf into DataFrame
df = tabula.read_pdf(r"pdf\10027183.pdf")

I've tried textract

import textract
text = textract.process(r"pdf\10027183.pdf", method='pdfminer')

And I've tried tika

from tika import parser

rawText = parser.from_file(r"pdf\10027183.pdf")
rawList = rawText['content'].splitlines()

I've tried pypdf (PyPDF2 got merged back into pypdf)

from pypdf import PdfReader


def get_pdf_content(pdf_file_path):
    reader = PdfReader(pdf_file_path)
    content = "\n".join(page.extract_text().strip() for page in reader.pages)
    content = " ".join(content.split())
    return content


print(get_pdf_content(r"pdf\10027183.pdf"))

And I have tried pdftotext

import pdftotext

with open(r"C:\Users\jfriel\Downloads\10027183.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)

# Just read the second page
print(pdf.read(2))

# Or read all the text at once
print(pdf.read_all())

Is there a way to read in tables from a pdf via python?

EDIT: This is the result for tabula, only returns 6 rows the pdf has 11:

results of tabula

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
jpf5046
  • 729
  • 7
  • 32

1 Answers1

2

Your document is encrypted. Have a look at the pdf trailer:

trailer
<< /Root 2 0 R
   /Info 1 0 R
   /ID [<BC5D1FCFDAF3326F2552B3182CCF1E18> <BC5D1FCFDAF3326F2552B3182CCF1E18>]
   /Encrypt 36 0 R
   /Size 37
>>

/Encrypt name refers to object number 36 generation 0. Let's use pdfreader to dive deeper:

from pdfreader import PDFDocument
fd = open("10027183.pdf","rb")  
doc = PDFDocument(fd)
obj = doc.locate_object(36,0)
print(obj)

you see

{'Filter': 'Standard', 
 'V': 2, 
 'R': 3, 
 'Length': 128, 
 'P': -3897, 
 'O': '36451BD39D753B7C1D10922C28E6665AA4F3353FB0348B536893E3B1DB5C579B', 
 'U': '7AFCC66F84741480C7129FC777BB1CDE28BF4E5E4E758A4164004E56FFFA0108'}

Value of V=2 stands for RC4 or AES algorithms permitting encryption key lengths greater than 40 bits. In your case it's just an empty password, as Adobe Reader doesn't asks for any password. Nevertheless all the data is encrypted still.

According to PDF spec "Encryption applies to all strings and streams ..." with few exceptions. This means you need to decrypt all streams and strings before data extraction.

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
  • would you recommend decrypting with `pdfReader.decrypt(' ')`? I found that solution here: https://stackoverflow.com/questions/49822853/how-to-read-this-pdf-form-using-pypdf2-in-python – jpf5046 Jan 06 '20 at 17:01
  • Thank you for the suggestion, I'm still stuck, will take the same idea and try with a different module – jpf5046 Jan 09 '20 at 14:04