18

I have a PDF form that needs to be filled out a bunch of times (it's a timesheet to be exact). Now since I don't want to do this by hand, I was looking for a way to fill them out using a python script or tools that could be used in a bash script.

Does anyone have experience with this?

McEnroe
  • 633
  • 3
  • 7
  • 17

4 Answers4

17

For Python you'll need the fdfgen lib and pdftk

@Hugh Bothwell's comment is 100% correct so I'll extend that answer with a working implementation.

If you're in windows you'll also need to make sure both python and pdftk are contained in the system path (unless you want to use long folder names).

Here's the code to auto-batch-fill a collection of PDF forms from a CSV data file:

import csv
from fdfgen import forge_fdf
import os
import sys

sys.path.insert(0, os.getcwd())
filename_prefix = "NVC"
csv_file = "NVC.csv"
pdf_file = "NVC.pdf"
tmp_file = "tmp.fdf"
output_folder = './output/'

def process_csv(file):
    headers = []
    data =  []
    csv_data = csv.reader(open(file))
    for i, row in enumerate(csv_data):
      if i == 0:
        headers = row
        continue;
      field = []
      for i in range(len(headers)):
        field.append((headers[i], row[i]))
      data.append(field)
    return data

def form_fill(fields):
  fdf = forge_fdf("",fields,[],[],[])
  fdf_file = open(tmp_file,"w")
  fdf_file.write(fdf)
  fdf_file.close()
  output_file = '{0}{1} {2}.pdf'.format(output_folder, filename_prefix, fields[1][1])
  cmd = 'pdftk "{0}" fill_form "{1}" output "{2}" dont_ask'.format(pdf_file, tmp_file, output_file)
  os.system(cmd)
  os.remove(tmp_file)

data = process_csv(csv_file)
print('Generating Forms:')
print('-----------------------')
for i in data:
  if i[0][1] == 'Yes':
    continue
  print('{0} {1} created...'.format(filename_prefix, i[1][1]))
  form_fill(i)

Note: It shouldn't be rocket-surgery to figure out how to customize this. The initial variable declarations contain the custom configuration.

In the CSV, in the first row each column will contain the name of the corresponding field name in the PDF file. Any columns that don't have corresponding fields in the template will be ignored.

In the PDF template, just create editable fields where you want your data to fill and make sure the names match up with the CSV data.

For this specific configuration, just put this file in the same folder as your NVC.csv, NVC.pdf, and a folder named 'output'. Run it and it automagically does the rest.

Evan Plaice
  • 13,944
  • 6
  • 76
  • 94
  • 1
    This works beautifully. Only thing I had to add was path to PDFtk: `code`os.environ['PATH'] += os.pathsep + 'C:\\Program Files (x86)\\PDFtk\\bin;' – Suzanne Aug 23 '17 at 22:00
  • 2
    I needed to replace `fdf_file = open(tmp_file,"w")` by `fdf_file = open(tmp_file,"wb")` to make it work. – edelans Jun 05 '20 at 16:34
  • The code runs, but I cant really see any data in the output pdf. any ideas? – mandar munagekar Aug 12 '21 at 14:32
17

Much faster version, no pdftk nor fdfgen needed, pure Python 3.6+:

# -*- coding: utf-8 -*-

from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader


def _getFields(obj, tree=None, retval=None, fileobj=None):
    """
    Extracts field data if this PDF contains interactive form fields.
    The *tree* and *retval* parameters are for recursive use.

    :param fileobj: A file object (usually a text file) to write
        a report to on all interactive form fields found.
    :return: A dictionary where each key is a field name, and each
        value is a :class:`Field<PyPDF2.generic.Field>` object. By
        default, the mapping name is used for keys.
    :rtype: dict, or ``None`` if form data could not be located.
    """
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
                       '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
        catalog = obj.trailer["/Root"]
        # get the AcroForm tree
        if "/AcroForm" in catalog:
            tree = catalog["/AcroForm"]
        else:
            return None
    if tree is None:
        return retval

    obj._checkKids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            # Tree is a field
            obj._buildField(tree, retval, fileobj, fieldAttributes)
            break

    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.getObject()
            obj._buildField(field, retval, fileobj, fieldAttributes)

    return retval


def get_form_fields(infile):
    infile = PdfFileReader(open(infile, 'rb'))
    fields = _getFields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())


def update_form_values(infile, outfile, newvals=None):
    pdf = PdfFileReader(open(infile, 'rb'))
    writer = PdfFileWriter()

    for i in range(pdf.getNumPages()):
        page = pdf.getPage(i)
        try:
            if newvals:
                writer.updatePageFormFieldValues(page, newvals)
            else:
                writer.updatePageFormFieldValues(page,
                                                 {k: f'#{i} {k}={v}'
                                                  for i, (k, v) in enumerate(get_form_fields(infile).items())
                                                  })
            writer.addPage(page)
        except Exception as e:
            print(repr(e))
            writer.addPage(page)

    with open(outfile, 'wb') as out:
        writer.write(out)


if __name__ == '__main__':
    from pprint import pprint

    pdf_file_name = '2PagesFormExample.pdf'

    pprint(get_form_fields(pdf_file_name))

    update_form_values(pdf_file_name, 'out-' + pdf_file_name)  # enumerate & fill the fields with their own names
    update_form_values(pdf_file_name, 'out2-' + pdf_file_name,
                       {'my_fieldname_1': 'My Value',
                        'my_fieldname_2': 'My Another alue'})  # update the form fields
dvska
  • 2,399
  • 1
  • 19
  • 14
  • shows syntax error here {k: f'#{i} {k}={v}'. using python 3.5. is that the reason? – Qaisar Rajput Feb 26 '18 at 11:53
  • 1
    f-strings require Python 3.6+. Workaround: `{k: "#{i} {k}={v}".format(**locals())}` – Rufflewind Mar 25 '18 at 07:13
  • Thank you so much for this. One tip for others reading: Open the original, make all the hardcoded changes you want, save it. (This allows easy editing of signatures and checkboxes.) Then only programmatically edit the fields you want edited. – Malcolm Crum Jul 16 '18 at 11:46
  • 2
    Unfortunately it seems that, after running this script and printing it out, the changes are not applied, though I do see them in Preview on my Mac. – Malcolm Crum Jul 23 '18 at 06:47
  • 1
    Copied exactly the same code and updated the source file name. It prints out all the fields but doesn't update anything in the output pdf file. Any suggestion? – Anoop Nair Apr 28 '19 at 16:08
  • 2
    If the filled values are hidden and only show up when you click on them in Acrobat, see discussion at: https://github.com/mstamy2/PyPDF2/issues/355 – Yifei H Nov 12 '19 at 17:01
  • If the filled values are hidden, you have not initialized the pdf writer object correctly. Try calling `pdf_writer.cloneReaderDocumentRoot(pdf_reader)` directly after creating the writer object. (This summarizes the issue #355 shared by @YifeiH above) – leezu Mar 01 '21 at 06:06
  • This works really well, but it doesn't handle checkboxes. Any ydeas? – vy32 Apr 21 '21 at 19:19
0

Replace Original File

os.system('pdftk "original.pdf" fill_form "data.fdf" output "output.pdf"')
os.remove("data.fdf")
os.remove("original.pdf")
os.rename("output.pdf","original.pdf")
0

I wrote a library built upon:'pdfrw', 'pdf2image', 'Pillow', 'PyPDF2' called fillpdf (pip install fillpdf and poppler dependency conda install -c conda-forge poppler)

Basic usage:

from fillpdf import fillpdfs

fillpdfs.get_form_fields("blank.pdf")

# returns a dictionary of fields
# Set the returned dictionary values a save to a variable
# For radio boxes ('Off' = not filled, 'Yes' = filled)

data_dict = {
'Text2': 'Name',
'Text4': 'LastName',
'box': 'Yes',
}

fillpdfs.write_fillable_pdf('blank.pdf', 'new.pdf', data_dict)

# If you want it flattened:
fillpdfs.flatten_pdf('new.pdf', 'newflat.pdf')

More info here: https://github.com/t-houssian/fillpdf

If some fields don't fill, use can use fitz (pip install PyMuPDF) and PyPDF2 (pip install PyPDF2) like the following altering the points as needed:

import fitz
from PyPDF2 import PdfFileReader

file_handle = fitz.open('blank.pdf')
pdf = PdfFileReader(open('blank.pdf','rb'))
box = pdf.getPage(0).mediaBox
w = box.getWidth()
h = box.getHeight()

# For images
image_rectangle = fitz.Rect((w/2)-200,h-255,(w/2)-100,h-118)
pages = pdf.getNumPages() - 1
last_page = file_handle[pages]
last_page._wrapContents()
last_page.insertImage(image_rectangle, filename=f'image.png')

# For text
last_page.insertText(fitz.Point((w/2)-247 , h-478), 'John Smith', fontsize=14, fontname="times-bold")
file_handle.save(f'newpdf.pdf')
Tyler Houssian
  • 365
  • 4
  • 7