Python Data Extraction from an Encrypted PDF

Question

I am an recent graduate in pure mathematics who only has taken few basic programming courses. I am doing an internship and I have an internal data analysis project. I have to analyze the internal PDFs of the last years. The PDFs are "secured." In other words, they are encrypted. We do not have PDF passwords, even more, we are not sure if passwords exist. But, we have all these documents and we can read them manually. We can print them as well. The goal is to read them with Python because is the language that we have some idea.

First, I tried to read the PDFs with some Python libraries. However, the Python libraries that I found do not read encrypted PDFs. At that time, I could not export the information using Adobe Reader either.

Second, I decided to decrypt the PDFs. I was successful using the Python library pykepdf. Pykepdf works very well! However, the decrypted PDFs cannot be read as well with the Python libraries of the previous point (PyPDF2 and Tabula). At this time, we have made some improvement because using Adobe Reader I can export the information from the decrypted PDFs, but the goal is to do everything with Python.

The code that I am showing works perfectly with unencrypted PDFs, but not with encrypted PDFs. It is not working with the decrypted PDFs that were gotten with pykepdf as well.

I did not write the code. I found it in the documentation of the Python libraries Pykepdf and Tabula. The PyPDF2 solution was written by Al Sweigart in his book, "Automate the Boring Stuff with Python," that I highly recommend. I also checked that the code is working fine, with the limitations that I explained before.

First question, why I cannot read the decrypted files, if the programs work with files that never have been encrypted?

Second question, Can we read with Python the decrypted files somehow? Which library can do it or is impossible? Are all decrypted PDFs extractable?

Thank you for your time and help!!!

I found these results using Python 3.7, Windows 10, Jupiter Notebooks, and Anaconda 2019.07.

Python

import pikepdf
with pikepdf.open("encrypted.pdf") as pdf:
  num_pages = len(pdf.pages)
  del pdf.pages[-1]
  pdf.save("decrypted.pdf")

import tabula
tabula.read_pdf("decrypted.pdf", stream=True)

import PyPDF2
pdfFileObj=open("decrypted.pdf", "rb")
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj=pdfReader.getPage(0)
pageObj.extractText()

With Tabula, I am getting the message "the output file is empty."

With PyPDF2, I am getting only '/n'

UPDATE 10/3/2019 Pdfminer.six (Version November 2018)

I got better results using the solution posted by DuckPuncher. For the decrypted file, I got the labels, but not the data. Same happens with the encrypted file. For the file that has never been encrypted works perfect. As I need the data and the labels of encrypted or decrypted files, this code does not work for me. For that analysis, I used pdfminer.six that is Python library that was released in November 2018. Pdfminer.six includes a library pycryptodome. According to their documentation "PyCryptodome is a self-contained Python package of low-level cryptographic primitives.."

The code is in the stack exchange question: Extracting text from a PDF file using PDFMiner in python?

I would love if you want to repeat my experiment. Here is the description:

1) Run the codes mention in this question with any PDF that never has been encrypted.

2) Do the same with a PDF "Secure" (this is a term that Adobe uses), I am calling it the encrypted PDF. Use a generic form that you can find using Google. After you download it, you need to fill the fields. Otherwise, you would be checking for labels, but not fields. The data is in the fields.

3) Decrypt the encrypted PDF using Pykepdf. This will be the decrypted PDF.

4) Run the codes again using the decrypted PDF.

UPDATE 10/4/2019 Camelot (Version July 2019)

I found the Python library Camelot. Be careful that you need camelot-py 0.7.3.

It is very powerful, and works with Python 3.7. Also, it is very easy to use. First, you need also to install Ghostscript. Otherwise, it will not work. You need also to install Pandas. Do not use pip install camelot-py. Instead use pip install camelot-py[cv]

The author of the program is Vinayak Mehta. Frank Du shares this code in a YouTube video "Extract tabular data from PDF with Camelot Using Python."

I checked the code and it is working with unencrypted files. However, it does not work with encrypted and decrypted files, and that is my goal.

Camelot is oriented to get tables from PDFs.

Here is the code:

Python

import camelot
import pandas
name_table = camelot.read_pdf("uncrypted.pdf")
type(name_table)

#This is a Pandas dataframe
name_table[0]

first_table = name_table[0]   

#Translate camelot table object to a pandas dataframe
first_table.df

first_table.to_excel("unencrypted.xlsx")
#This creates an excel file.
#Same can be done with csv, json, html, or sqlite.

#To get all the tables of the pdf you need to use this code.
for table in name_table:
   print(table.df)

UPDATE 10/7/2019 I found one trick. If I open the secured pdf with Adobe Reader, and I print it using Microsoft to PDF, and I save it as a PDF, I can extract the data using that copy. I also can convert the PDF file to JSON, Excel, SQLite, CSV, HTML, and another formats. This is a possible solution to my question. However, I am still looking for an option to do it without that trick because the goal is to do it 100% with Python. I am also concerned that if a better method of encryption is used the trick maybe would not work. Sometimes you need to use Adobe Reader several times to get an extractable copy.

UPDATE 10/8/2019. Third question. I have now a third question. Do all secured/encrypted pdf are password protected? Why pikepdf is not working? My guess is that the current version of pikepdf can break some type of encryptions but not all of them. @constt mentioned that PyPDF2 can break some type of protection. However, I replied to him that I found an article that PyPDF2 can break encryptions made with Adobe Acrobat Pro 6.0, but no with posterior versions.

I could not reproduce these issues with `PyPDF2`, everything works just fine. I used `pdftk` as well as online services to encrypt files. Can you post links to "troublesome" pdf files? — constt, Oct 07 '19 at 05:47
@constt The case that you mention is when the password is known? I am not 100% sure, but it seems that when you encrypt a pdf you need to enter a password. My guess is that pdftk is an old encryption system. The last version of Adobe Acrobat Pro produces an improved encryption. I have a pdf sample, but I do not see how to add it to the question. Also, this website does not allow links that go to a different website. The only solution that I see is to download a trial of Acrobat Pro, and encrypt a pdf. However, do not use the password. Sorry if my answer is not helpful. — Beginner, Oct 07 '19 at 23:29
@constt I found on GitHub an article, "PyPDF2 can't decrypt PDF files with Acrobat 6.0 or higher password security compatibility." I guess that is the answer to your question. — Beginner, Oct 08 '19 at 01:53
OK, thanks! Have you tried to use `qpdf` to decrypt your files? In the case it will do the trick, you can call it from your script using `subprocess` module to decrypt files before parsing them. — constt, Oct 08 '19 at 03:46
@constt I read in the pikepdf documentation that pikepdf is based on qpdf, so most likely will be the same situation. Additionally, as we are beginners in programming, we have no idea how to use qpdf without pikepdf. I heard that we would need to use Linux. We are only Windows/Python users. — Beginner, Oct 08 '19 at 17:19
First, PyPDF2 cannot decrypt Acrobat PDF files => 6.0. Second, pikepdf currently does not have text extraction implement. — Life is complex, Oct 09 '19 at 00:40
@Lifeiscomplex Thank you for your help! My doubt is why if the pdf is unencrypted with pykepdf, the other Python libraries that support text extraction cannot read it. — Beginner, Oct 09 '19 at 01:22
@Beginner I would speculate that this has to do with the underlying formatting being used by pykepdf to write the unencrypted PDF. — Life is complex, Oct 09 '19 at 02:29
*"Do all secured/encrypted pdf are password protected?"* - no. There also are pdfs encrypted using private/public key cryptography based on X509 certificates. — mkl, Oct 09 '19 at 05:14
ref: UPDATE 10/7/2019 -- Can you please provide more technical details on how you are bypassing the secure PDF features with Adode reader? This should not be possible unless the file has no protection enabled. — Life is complex, Oct 09 '19 at 17:59
@Lifeiscomplex As it is explained in my UPDATE 10/7/2019, using a combination of Adobe Reader and the printer Microsoft to PDF, a copy is created. That copy is a PDF that never has been encrypted, and for that reason works with almost all the Python libraries. I tested with Camelot, and Tabula. It is not a scientific method. Most likely in the future will not work. Maybe it is a bug in the encryption program. That is why I am still looking for a different method. However, it seems that your opinion would be that the method does not exist. — Beginner, Oct 09 '19 at 18:37
Just to answer one of your questions, you can indeed encrypt a PDF file without requiring a password. This is the case of PDF files that have feature restrictions (no printing, no resave, no edition, etc) — yms, Oct 12 '19 at 12:29

Life is complex · Accepted Answer · 2019-10-11T15:21:40.683

LAST UPDATED 10-11-2019

I'm unsure if I understand your question completely. The code below can be refined, but it reads in either an encrypted or unencrypted PDF and extracts the text. Please let me know if I misunderstood your requirements.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def extract_encrypted_pdf_text(path, encryption_true, decryption_password):

  output = StringIO()

  resource_manager = PDFResourceManager()
  laparams = LAParams()

  device = TextConverter(resource_manager, output, codec='utf-8', laparams=laparams)

  pdf_infile = open(path, 'rb')
  interpreter = PDFPageInterpreter(resource_manager, device)

  page_numbers = set()

  if encryption_true == False:
    for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, caching=True, check_extractable=True):
      interpreter.process_page(page)

  elif encryption_true == True:
    for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, password=decryption_password, caching=True, check_extractable=True):
      interpreter.process_page(page)

 text = output.getvalue()
 pdf_infile.close()
 device.close()
 output.close()
return text

results = extract_encrypted_pdf_text('encrypted.pdf', True, 'password')
print (results)

I noted that your pikepdf code used to open an encrypted PDF was missing a password, which should have thrown this error message:

pikepdf._qpdf.PasswordError: encrypted.pdf: invalid password

import pikepdf

with pikepdf.open("encrypted.pdf", password='password') as pdf:
num_pages = len(pdf.pages)
del pdf.pages[-1]
pdf.save("decrypted.pdf")

You can use tika to extract the text from the decrypted.pdf created by pikepdf.

from tika import parser

parsedPDF = parser.from_file("decrypted.pdf")
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')

Additionally, pikepdf does not currently implement text extraction this includes the latest release v1.6.4.

I decided to run a couple of test using various encrypted PDF files.

I named all the encrypted files 'encrypted.pdf' and they all used the same encryption and decryption password.

Adobe Acrobat 9.0 and later - encryption level 256-bit AES
- pikepdf was able to decrypt this file
- PyPDF2 could not extract the text correctly
- tika could extract the text correctly
Adobe Acrobat 6.0 and later - encryption level 128-bit RC4
- pikepdf was able to decrypt this file
- PyPDF2 could not extract the text correctly
- tika could extract the text correctly
Adobe Acrobat 3.0 and later - encryption level 40-bit RC4
- pikepdf was able to decrypt this file
- PyPDF2 could not extract the text correctly
- tika could extract the text correctly
Adobe Acrobat 5.0 and later - encryption level 128-bit RC4
- created with Microsoft Word
- pikepdf was able to decrypt this file
- PyPDF2 could extract the text correctly
- tika could extract the text correctly
Adobe Acrobat 9.0 and later - encryption level 256-bit AES
- created using pdfprotectfree
- pikepdf was able to decrypt this file
- PyPDF2 could extract the text correctly
- tika could extract the text correctly

PyPDF2 was able to extract text from decrypted PDF files not created with Adobe Acrobat.

I would assume that the failures have something to do with embedded formatting in the PDFs created by Adobe Acrobat. More testing is required to confirm this conjecture about the formatting.

tika was able to extract text from all the documents decrypted with pikepdf.

 import pikepdf
 with pikepdf.open("encrypted.pdf", password='password') as pdf:
    num_pages = len(pdf.pages)
    del pdf.pages[-1]
    pdf.save("decrypted.pdf")


 from PyPDF2 import PdfFileReader

 def text_extractor(path):
   with open(path, 'rb') as f:
     pdf = PdfFileReader(f)
     page = pdf.getPage(1)
     print('Page type: {}'.format(str(type(page))))
     text = page.extractText()
     print(text)

    text_extractor('decrypted.pdf')

PyPDF2 cannot decrypt Acrobat PDF files => 6.0

This issue has been open with the module owners, since September 15, 2015. It unclear in the comments related to this issue when this problem will be fixed by the project owners. The last commit was June 25, 2018.

PyPDF4 decryption issues

PyPDF4 is the replacement for PyPDF2. This module also has decryption issues with certain algorithms used to encrypt PDF files.

test file: Adobe Acrobat 9.0 and later - encryption level 256-bit AES

PyPDF2 error message: only algorithm code 1 and 2 are supported

PyPDF4 error message: only algorithm code 1 and 2 are supported. This PDF uses code 5

UPDATE SECTION 10-11-2019

This section is in response to your updates on 10-07-2019 and 10-08-2019.

In your update you stated that you could open a 'secured pdf with Adobe Reader' and print the document to another PDF, which removes the 'SECURED' flag. After doing some testing, I believe that have figured out what is occurring in this scenario.

Adobe PDFs level of security

Adobe PDFs have multiple types of security controls that can be enabled by the owner of the document. The controls can be enforced with either a password or a certificate.

Document encryption (enforced with a document open password)
- Encrypt all document contents (most common)
- Encrypt all document contents except metadata => Acrobat 6.0
- Encrypt only file attachments => Acrobat 7.0
Restrictive editing and printing (enforced with a permissions password)
- Printing Allowed
- Changes Allowed

The image below shows an Adobe PDF being encrypted with 256-Bit AES encryption. To open or print this PDF a password is required. When you open this document in Adobe Reader with the password, the title will state SECURED

This document requires a password to open with the Python modules that are mentioned in this answer. If you attempt to open an encrypted PDF with Adobe Reader. You should see this:

If you don't get this warning then the document either has no security controls enable or only has the restrictive editing and printing ones enabled.

The image below shows restrictive editing being enabled with a password in a PDF document. Note printing is enabled. To open or print this PDF a password is not required. When you open this document in Adobe Reader without a password, the title will state SECURED This is the same warning as the encrypted PDF that was opened with a password.

When you print this document to a new PDF the SECURED warning is removed, because the restrictive editing has been removed.

All Adobe products enforce the restrictions set by the permissions password. However, if third-party products do not support these settings, document recipients are able to bypass some or all of the restrictions set.

So I assume that the document that you are printing to PDF has restrictive editing enabled and does not have a password required to open enabled.

Concerning breaking PDF encryption

Neither PyPDF2 or PyPDF4 are designed to break the document open password function of a PDF document. Both the modules will throw the following error if they attempt to open an encrypted password protected PDF file.

PyPDF2.utils.PdfReadError: file has not been decrypted

The opening password function of an encrypted PDF file can be bypassed using a variety of methods, but a single technique might not work and some will not be acceptable because of several factors, including password complexity.

PDF encryption internally works with encryption keys of 40, 128, or 256 bit depending on the PDF version. The binary encryption key is derived from a password provided by the user. The password is subject to length and encoding constraints.

For example, PDF 1.7 Adobe Extension Level 3 (Acrobat 9 - AES-256) introduced Unicode characters (65,536 possible characters) and bumped the maximum length to 127 bytes in the UTF-8 representation of the password.

The code below will open a PDF with restrictive editing enabled. It will save this file to a new PDF without the SECURED warning being added. The tika code will parse the contents from the new file.

from tika import parser
import pikepdf

# opens a PDF with restrictive editing enabled, but that still 
# allows printing.
with pikepdf.open("restrictive_editing_enabled.pdf") as pdf:
  pdf.save("restrictive_editing_removed.pdf")

  # plain text output
  parsedPDF = parser.from_file("restrictive_editing_removed.pdf")

  # XHTML output
  # parsedPDF = parser.from_file("restrictive_editing_removed.pdf", xmlContent=True)

  pdf = parsedPDF["content"]
  pdf = pdf.replace('\n\n', '\n')
  print (pdf)

This code checks if a password is required for opening the file. This code be refined and other functions can be added. There are several other features that can be added, but the documentation for pikepdf does not match the comments within the code base, so more research is required to improve this.

# this would be removed once logging is used
############################################
import sys
sys.tracebacklimit = 0
############################################

import pikepdf
from tika import parser

def create_pdf_copy(pdf_file_name):
  with pikepdf.open(pdf_file_name) as pdf:
    new_filename = f'copy_{pdf_file_name}'
    pdf.save(new_filename)
    return  new_filename

def extract_pdf_content(pdf_file_name):
  # plain text output
  # parsedPDF = parser.from_file("restrictive_editing_removed.pdf")

  # XHTML output
  parsedPDF = parser.from_file(pdf_file_name, xmlContent=True)

  pdf = parsedPDF["content"]
  pdf = pdf.replace('\n\n', '\n')
  return pdf

def password_required(pdf_file_name):
  try:
    pikepdf.open(pdf_file_name)

  except pikepdf.PasswordError as error:
    return ('password required')

  except pikepdf.PdfError as results:
    return ('cannot open file')


filename = 'decrypted.pdf'
password = password_required(filename)
if password != None:
  print (password)
elif password == None:
  pdf_file = create_pdf_copy(filename)
  results = extract_pdf_content(pdf_file)
  print (results)

First, please accept my apologies for the delay. I had lost hope that somebody would answer. Thank you for your help! It looks great! However, it seems that your code requires a password. It is something that we do not have. For that reason, my code does not have a password. My UPDATE 10/7/2019 shows a way to get the data without a password. However, it looks like an artisan method. "My method" does not solve my problem at 100%. — Beginner, Oct 09 '19 at 16:54
How are you opening a secure PDF file without providing a password? — Life is complex, Oct 09 '19 at 17:40
I did what is explained in my UPDATE 10/7/2019. I create a copy with that method. After that, almost all libraries work. The copy is a document that never has been encrypted. — Beginner, Oct 09 '19 at 18:31
The material that you added is fantastic! Thank you so much! The question that remains is how can we extract the data of this type of PDFs ONLY with Python? Does that method exist? — Beginner, Oct 10 '19 at 01:49
Answer updated with code that worked with a PDF that had restrictive editing protection enabled, but allowed printing. — Life is complex, Oct 10 '19 at 02:14
That looks like a big improvement! Sorry, to abuse of your brilliant brain. What if we need to get JSON, SQLite, CSV, xlsx format? Camelot or Tabula can do that with normal PDFs.. How would you do that? Is it possible? — Beginner, Oct 10 '19 at 02:56
It would be nice to have that format. However, the other formats are the ones that we are looking for. With the trick we can do all that. However, we need something similar but only with Python. The problem with the trick is that we cannot automate the process. — Beginner, Oct 10 '19 at 03:30
I modified the answer to output XHTML. JSON is possible, but it requires digging into the github project code related the tika parser. — Life is complex, Oct 10 '19 at 03:37
Thank for the update. At the end, we need to know what is possible. — Beginner, Oct 10 '19 at 19:07

Mahendra Singh · Answer 2 · 2019-10-10T06:09:04.997

1

You can try to handle the error these files produce when you open these files without a password.

import pikepdf

def open_pdf(pdf_file_path, pdf_password=''):
    try:
        pdf_obj = pikepdf.Pdf.open(pdf_file_path)

    except pikepdf._qpdf.PasswordError:
        pdf_obj = pikepdf.Pdf.open(pdf_file_path, password=pdf_password)

    finally:
        return pdf_obj

You can use the returned pdf_obj for your parsing work. Also, you can provide the password in case you have an encrypted PDF.

edited Oct 10 '19 at 06:09

answered Oct 10 '19 at 03:47

Mahendra Singh

508
2
9

1

Thank you for your answer! We are trying to read it without a password. At this time, we were able to do it with the method that was explained in my UPDATE 10/7/2019 – Beginner Oct 10 '19 at 16:54
This is far from answering the question. Seems like you haven't read the complete question. – shoonya ek Oct 11 '19 at 06:45
1

This handles those secured PDFs where normally pikepdf fails when default value of the password is None. By passing an empty string it is able to open and parse a secured PDF document properly (in the test cases that I ran). – Mahendra Singh Oct 11 '19 at 09:13
1

@Beginner u don't have to convert the PDFs here in this case. This is just from my prior experience that secured PDFs work by providing an empty password. – Mahendra Singh Oct 11 '19 at 09:15
I tried to run your code, but it is not working. Can you publish your entire code? I am a beginner. I am talking about the code that is working in your computer. Can you please show what kind of formats you can produce. – Beginner Oct 11 '19 at 18:43
1

@Beginner this is my entire code. This only returns the pdf_object from pikepdf. In case you want to save this pdf, just save the returned object by using pdf_obj.save('your_file_path'). After this, you can use this PDF to parse text and other objects. I use a library called [PdfPlumber](https://github.com/jsvine/pdfplumber) for text extraction. – Mahendra Singh Oct 15 '19 at 07:15

score 1 · Answer 3 · answered Nov 24 '19 at 00:34

1

For tabula-py, you can try password option with read_pdf. It depends on tabula-java's function so I'm not sure which encryption is supported though.

answered Nov 24 '19 at 00:34

chezou

486
4
12

Python Data Extraction from an Encrypted PDF

3 Answers3

Linked