Search and replace for text within a pdf, in Python

Question

I am writing mailmerge software as part of a Python web app.

I have a template called letter.pdf which was generated from a MS Word file and includes the text {name} where the resident's name will go. I also have a list of c. 100 residents' names.

What I want to do is to read in letter.pdf do a search for "{name}" and replace it with the resident's name (for each resident) then write the result to another pdf. I then want to gather all these pdfs together into a big pdf (one page per letter) which my web app's users will print out to create their letters.

Are there any Python libraries that will do this? I've looked at pdfrw and pdfminer but I couldn't see where they would be able to do it.

(NB: I also have the MS Word file, so if there was another way of using that, and not going through a pdf, that would also do the job.)

Dmytro · Answer 1 · 2021-03-20T23:07:40.423

This can be done with PyPDF2 package. The implementation may depend on the original PDF template structure. But if the template is stable enough and isn't changed very often the replacement code shouldn't be generic but rather simple.

I did a small sketch on how you could replace the text inside a PDF file. It replaces all occurrences of PDF tokens to DOC.

import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject


def replace_text(content, replacements = dict()):
    lines = content.splitlines()

    result = ""
    in_text = False

    for line in lines:
        if line == "BT":
            in_text = True

        elif line == "ET":
            in_text = False

        elif in_text:
            cmd = line[-2:]
            if cmd.lower() == 'tj':
                replaced_line = line
                for k, v in replacements.items():
                    replaced_line = replaced_line.replace(k, v)
                result += replaced_line + "\n"
            else:
                result += line + "\n"
            continue

        result += line + "\n"

    return result


def process_data(object, replacements):
    data = object.getData()
    decoded_data = data.decode('utf-8')

    replaced_data = replace_text(decoded_data, replacements)

    encoded_data = replaced_data.encode('utf-8')
    if object.decodedSelf is not None:
        object.decodedSelf.setData(encoded_data)
    else:
        object.setData(encoded_data)


if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--input", required=True, help="path to PDF document")
    args = vars(ap.parse_args())

    in_file = args["input"]
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = { 'PDF': 'DOC'}

    pdf = PdfFileReader(in_file)
    writer = PdfFileWriter()

    for page_number in range(0, pdf.getNumPages()):

        page = pdf.getPage(page_number)
        contents = page.getContents()

        if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
            process_data(contents, replacements)
        elif len(contents) > 0:
            for obj in contents:
                if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
                    streamObj = obj.getObject()
                    process_data(streamObj, replacements)

        writer.addPage(page)

    with open(filename_base + ".result.pdf", 'wb') as out_file:
        writer.write(out_file)

The results are

UPDATE 2021-03-21:

Updated the code example to handle DecodedStreamObject and EncodedStreamObject which actually contian data stream with text to update.

This is working for sample file but I'm getting this error while working on a certificate. `data = object.getData() AttributeError: 'NameObject' object has no attribute 'getData'` any resolution to this? — Varad More, Oct 19 '20 at 16:33
Same Issue! `AttributeError: 'NameObject' object has no attribute 'getData'` — mattf, Nov 23 '20 at 07:10
This means that the PDF conent stream structure is different. Could you provide a link to the sample PDF that you're dealing with please. Then I could update the answer. — Dmytro, Nov 23 '20 at 17:39
for example this pdf downloaded from google docs. we.tl / t-pYzmky0R5B — swisswiss, Dec 11 '20 at 04:52
@Dmytro any solution please i am also getting the same issue `AttributeError: 'NameObject' object has no attribute 'getData'` — Hafiz Siddiq, Dec 20 '20 at 14:59
@swisswiss, Sorry for not answering earlier. Could you please share the pdf doc again, cause the link has expired. — Dmytro, Feb 03 '21 at 17:46
@Dmytro Looks like any basic PDF file generated by GhostScript generates the error: https://gofile.io/d/qxJKOK — mrgou, Mar 20 '21 at 09:22
@mrgou I updated the code example to handle the data streams. Not sure if it works with all kinds of PDFs but at least processes the PDF you provided. The idea is basically to find either `DecodedStreamObject` or `EncodedStreamObject` in the PDF pages and apply the replacement code to their contents. — Dmytro, Mar 20 '21 at 23:13
This solution doesn't work for PDFs created from Word. How do you create a simple PDF from a word doc that would be compliant? — alias51, Oct 12 '21 at 14:31
This only works when the text in a pdf is plaintext. For example, a PDF may have content like: `(A)-5.5 (BC OF)-5.5 ( ALPHA)7.4 (B)-5.5 (E)2 (T)`. — Chris, Apr 29 '22 at 21:59

score 9 · Answer 2 · answered Oct 04 '21 at 17:01

If @Dmytrio solution do not alter final PDF

Dymitrio's updated code example to handle DecodedStreamObject and EncodedStreamObject which actually contain data stream with text to update could run fine, but with a file different from example, was not able to alter pdf text content.

According to EDIT 3, from How to replace text in a PDF using Python?:

By inserting page[NameObject("/Contents")] = contents.decodedSelf before writer.addPage(page), we force pyPDF2 to update content of the page object.

This way I was able to overcome this problem and replace text from pdf file.

Final code should look like this:

import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject, NameObject


def replace_text(content, replacements = dict()):
    lines = content.splitlines()

    result = ""
    in_text = False

    for line in lines:
        if line == "BT":
            in_text = True

        elif line == "ET":
            in_text = False

        elif in_text:
            cmd = line[-2:]
            if cmd.lower() == 'tj':
                replaced_line = line
                for k, v in replacements.items():
                    replaced_line = replaced_line.replace(k, v)
                result += replaced_line + "\n"
            else:
                result += line + "\n"
            continue

        result += line + "\n"

    return result


def process_data(object, replacements):
    data = object.getData()
    decoded_data = data.decode('utf-8')

    replaced_data = replace_text(decoded_data, replacements)

    encoded_data = replaced_data.encode('utf-8')
    if object.decodedSelf is not None:
        object.decodedSelf.setData(encoded_data)
    else:
        object.setData(encoded_data)


if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--input", required=True, help="path to PDF document")
    args = vars(ap.parse_args())

    in_file = args["input"]
    filename_base = in_file.replace(os.path.splitext(in_file)[1], "")

    # Provide replacements list that you need here
    replacements = { 'PDF': 'DOC'}

    pdf = PdfFileReader(in_file)
    writer = PdfFileWriter()

    for page_number in range(0, pdf.getNumPages()):

        page = pdf.getPage(page_number)
        contents = page.getContents()

        if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
            process_data(contents, replacements)
        elif len(contents) > 0:
            for obj in contents:
                if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
                    streamObj = obj.getObject()
                    process_data(streamObj, replacements)

        # Force content replacement
        page[NameObject("/Contents")] = contents.decodedSelf
        writer.addPage(page)

    with open(filename_base + ".result.pdf", 'wb') as out_file:
        writer.write(out_file)

Important: from PyPDF2.generic import NameObject

I have this problem, but it seams to be that `data.decode('utf-8')` does not decode to a text format? — alias51, Oct 12 '21 at 12:59
It is possible that your PDF do not use utf-8 encoding. You might wanna test if `data.decode("ascii")` works for you. By the way if you live in Latin America (such as I do) you may want to try `data.decode("iso-8859-1")`. If this doesnt helps, you can try to brute force decoding by parsing `data.decode("utf-8", "ignore")` — Vladimir Simoes da Luz Junior, Oct 13 '21 at 13:37
I ran a `for` loop over every known standard and it didn't work. I can only assume that Acrobat encodes PDFs differently when `Save As` from Word is used? — alias51, Oct 13 '21 at 14:42
@alias51, have you tried to `print(data = object.getData())` inside proces_data() ? If that does not give you the text content of the pdf, it is possible that your file has been password encrypted by Acrobat. You can get some reference on password decrypting here: https://github.com/mstamy2/PyPDF2/issues/378 ; https://github.com/atlanhq/camelot/issues/325 ; https://github.com/mstamy2/PyPDF2/issues/378#issuecomment-689585779 — Vladimir Simoes da Luz Junior, Oct 27 '21 at 02:32
I tried to run this code, but I got an error - AttributeError: 'ArrayObject' object has no attribute 'decodeSelf'. Do you have any idea to solve it? — moep0, May 24 '22 at 06:52
I got an error `Exception has occurred: ValueError value must be PdfObject` that occurs when running `page[NameObject("/Contents")] = contents.decodedSelf`. Any idea? — MrT77, Sep 05 '22 at 14:27
Upon inspection, i realised that `contents.decodedSelf` is `None`... What am I doing wrong? — MrT77, Sep 05 '22 at 14:49
The code is not running, the file is never created. I'm getting no errors even. — Arnav, Sep 17 '22 at 11:00

score 5 · Answer 3 · answered Sep 22 '21 at 00:23

5

Decompress the pdf to make parsing easier (solves many of the issues in the previous answer). I use pdftk. (If this step fails, one hack to pre-process the pdf is to open the pdf in OSX Preview, print it, and then choose save as pdf from the print menu. Then retry the command below.)

pdftk original.pdf output uncompressed.pdf uncompress

Parse and replace using PyPDF2.

from PyPDF2 import PdfFileReader, PdfFileWriter

replacements = [
    ("old string", "new string")
]

pdf = PdfFileReader(open("uncompressed.pdf", "rb"))
writer = PdfFileWriter() 

for page in pdf.pages:
    contents = page.getContents().getData()
    for (a,b) in replacements:
        contents = contents.replace(a.encode('utf-8'), b.encode('utf-8'))
    page.getContents().setData(contents)
    writer.addPage(page)
    
with open("modified.pdf", "wb") as f:
     writer.write(f)

[Optional] Re-compress the pdf.

pdftk modified.pdf output recompressed.pdf compress

answered Sep 22 '21 at 00:23

D.Deriso

4,271
2
21
14

9

Results in `PyPDF2.utils.PdfReadError: Creating EncodedStreamObject is not currently supported` – alias51 Oct 12 '21 at 11:02
Same. Any idea how to fix this? – Arnav Sep 17 '22 at 10:55
Not sure why that error is occuring. I just double checked and this recipe still works on my end. Perhaps it's an issue that should be reported to the PyPDF2 github repo. – D.Deriso Sep 27 '22 at 01:22
1

PyPDF2 seems to have been merged with or renamed to pypdf again. The camel-case methods are now considered deprecated. However, your code was helpful for my alternative solution [here](https://stackoverflow.com/questions/31703037/how-can-i-replace-text-in-a-pdf-using-python/#75822833). – Hermann Mar 23 '23 at 12:13

score 2 · Answer 4 · answered Sep 13 '22 at 10:27

Here is a solution using the MS Word source file.

As trying to edit the pdf itself turned out to be too complicated for me because of the encoding errors, I went with the MS Word >> Pdf option.

Prepare MS Word template with {{input_fields}}
Fill in the template with data
Convert the filled in MS Word file to PDF

The DocxTemplate module uses jinja like syntax: {{variable_name}}

In my solution I use an intermediate temp file. I tried to get rid of this step using BytesIO/StringIO to virtualize this step only in memory, but haven't make that work yet.

Here is an easy and working solution to perform the required task:

import os
import comtypes.client
from pathlib import Path
from docxtpl import DocxTemplate
import random


# CFG
in_file_path = "files/template.docx"
temp_file_path = "files/"+str(random.randint(0,50))+".docx"
out_file_path = "files/output.pdf"


# Fill in text
data_to_fill = {'Field_name' : "John Tester",
                  'Field_ocupation' : "Test tester",
                  'Field_address' : "Test Address 123",
                  }

template = DocxTemplate(Path(in_file_path))
template.render(data_to_fill)
template.save(Path(temp_file_path))

# Convert to PDF
wdFormatPDF = 17

in_file = os.path.abspath(Path(temp_file_path))
out_file = os.path.abspath(Path(out_file_path))

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

# Get rid of the temp file
os.remove(Path(temp_file_path))

Search and replace for text within a pdf, in Python

4 Answers4

If @Dmytrio solution do not alter final PDF

Linked

Related