My project is to automatically fill the PDF form of the German railway company (Deutsche Bahn) for delayed trains. https://www.bahn.de/wmedia/view/mdb/media/intern/fahrgastrechteformular.pdf
When you open the link with Google Chrome you can easily edit the document. So it should also be possible to do it in python.
I tried multiple things:
1. Using PyPDF2
and the approach that is suggested in the second answer in this stack overflow question: Batch fill PDF forms from python or bash
# -*- coding: utf-8 -*-
from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader
def _getFields(obj, tree=None, retval=None, fileobj=None):
"""
Extracts field data if this PDF contains interactive form fields.
The *tree* and *retval* parameters are for recursive use.
:param fileobj: A file object (usually a text file) to write
a report to on all interactive form fields found.
:return: A dictionary where each key is a field name, and each
value is a :class:`Field<PyPDF2.generic.Field>` object. By
default, the mapping name is used for keys.
:rtype: dict, or ``None`` if form data could not be located.
"""
fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
'/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
if retval is None:
retval = OrderedDict()
catalog = obj.trailer["/Root"]
# get the AcroForm tree
if "/AcroForm" in catalog:
tree = catalog["/AcroForm"]
else:
return None
if tree is None:
return retval
obj._checkKids(tree, retval, fileobj)
for attr in fieldAttributes:
if attr in tree:
# Tree is a field
obj._buildField(tree, retval, fileobj, fieldAttributes)
break
if "/Fields" in tree:
fields = tree["/Fields"]
for f in fields:
field = f.getObject()
obj._buildField(field, retval, fileobj, fieldAttributes)
return retval
def get_form_fields(infile):
infile = PdfFileReader(open(infile, 'rb'))
fields = _getFields(infile)
return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())
if __name__ == '__main__':
from pprint import pprint
pdf_file_name = '2PagesFormExample.pdf'
pprint(get_form_fields(pdf_file_name))
However the program has the problem to decrypt the PDF:
File "c:\Users\User1\iCloudDrive\fahrgastrechte\fahrgastrechte.py", line 94, in <module>
pprint(get_form_fields(pdf_file_name))
File "c:\Users\User1\iCloudDrive\fahrgastrechte\fahrgastrechte.py", line 62, in get_form_fields
fields = _getFields(infile)
File "c:\Users\User1\iCloudDrive\fahrgastrechte\fahrgastrechte.py", line 32, in _getFields
catalog = obj.trailer["/Root"]
File "C:\Program Files\Python36\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
return dict.__getitem__(self, key).getObject()
File "C:\Program Files\Python36\lib\site-packages\PyPDF2\generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "C:\Program Files\Python36\lib\site-packages\PyPDF2\pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted
I don't know why decryption is necessary at all, because I only want to read data in the first place. I could understand when it is about writing data. However it is also possible to write in the fields of the PDF when using for example Google Chrome.
2. Using pypdftk.
In the beginning I just wanted to read the data of the form:
import pypdftk
pdf_file_name = './fahrgastrechteformular.pdf'
data = pypdftk.dump_data_fields(pdf_file_name)
Currently my system (Windows 10) is not recognizing the pdftk.exe which the pyhton module is calling. So I directly called it in bash:
pdftk.exe fahrgastrechteformular.pdf dum_data_fields
I also got an encryption error back:
Error: Failed to open PDF file:
fahrgastrechteformular.pdf
OWNER PASSWORD REQUIRED, but not given (or incorrect)
Error: Unable to find file.
Error: Failed to open PDF file:
dum_data_fields
Done. Input errors, so no output created.
So in the beginning I just want to read the form fields of the PDF. For example, when I filled in the first field "Berlin Central Station" with Google Chrome for example. I want to read it out by the python scripts mentioned above. Next step would be, to actually edit the fields content. Hope you can follow. Please ask question when something is unclear.