0

I am trying to run the following code to replace text inside a PDF file:

import os
import re
import PyPDF2
from io import StringIO

# Define a function to replace text in a PDF file
def replace_text_in_pdf(input_pdf_path, output_pdf_path, search_text, replace_text):
    # Open the input PDF file in read-binary mode
    with open(input_pdf_path, 'rb') as input_file:
        # Create a PDF reader object
        pdf_reader = PyPDF2.PdfReader(input_file)
        
        # Create a PDF writer object
        pdf_writer = PyPDF2.PdfWriter()
        
        # Iterate through each page of the PDF
        for page_num in range(len(pdf_reader.pages)):
            # Get the page object
            page = pdf_reader.pages[page_num]
            
            # Get the text content of the page
            text = page.extract_text()
            
            # Replace the search text with the replace text
            new_text = re.sub(search_text, replace_text, text)
            
            # Create a new page with the replaced text
            new_page = PyPDF2.PageObject.create_blank_page(None, page.mediabox.width, page.mediabox.height)
            new_page.merge_page(page)  # Copy the original page content to the new page
            new_page.add_transformation(PyPDF2.Transformation().translate(0, 0).scale(1, 1))  # Reset the transformation matrix
            
            # Begin the text object
            new_page._text = PyPDF2.ContentStream(new_page.pdf)
            new_page._text.beginText()
            
            # Set the font and font size
            new_page._text.setFont("Helvetica", 12)
            
            # Draw the new text on the page
            x, y = 100, 100  # Replace with the desired position of the new text
            new_page._text.setFontSize(12)
            new_page._text.textLine(x, y, new_text)
            
            # End the text object
            new_page._text.endText()
            
            # Add the new page to the PDF writer object
            pdf_writer.addPage(new_page)
        
        # Save the new PDF file
        with open(output_pdf_path, 'wb') as output_file:
            pdf_writer.write(output_file)

# Call the function to replace text in a PDF file
input_pdf_path = r'D:\file1.pdf'  # Replace with your input PDF file path
output_pdf_path = r'D:\file1_replaced.pdf'  # Replace with your output PDF file path
search_text = '<FirstName>'  # Replace with the text you want to replace
replace_text = 'John'  # Replace with the text you want to replace it with
replace_text_in_pdf(input_pdf_path, output_pdf_path, search_text, replace_text)

However, line: new_page._text = PyPDF2.ContentStream(new_page.pdf) is giving me the following error: module 'PyPDF2' has no attribute 'ContentStream'.

Can someone help how to fix it?

gtomer
  • 5,643
  • 1
  • 10
  • 21
  • You should try `pdfrw`; `ContentStream` is not in the `PyPDF2` module. – Memristor May 11 '23 at 11:31
  • 4
    `PyPDF2` has a `ContentStream`. It can be found as `PyPDF2.generic.ContentStream`. If possible, you also should switch to using the `pypdf`, under which `PyPDF2` still gets updates [PyPDF2 deprecation notice](https://pypi.org/project/PyPDF2/) – Clasherkasten May 11 '23 at 12:13
  • @Clasherkasten- you are right. However, switching to pypdf gave the same error: AttributeError: module 'pypdf' has no attribute 'ContentStream' – gtomer May 11 '23 at 18:57
  • And when trying to use '.gereric' I get this error: TypeError: ContentStream.__init__() missing 1 required positional argument: 'pdf' – gtomer May 11 '23 at 19:23
  • What `PyPDF2` version are you using? – Memristor May 16 '23 at 09:39
  • The most recent doc has many changes (I'm trying it with 3.0.1): https://pypdf2.readthedocs.io/_/downloads/en/latest/pdf/ – Memristor May 16 '23 at 10:19
  • PyPDF version 3.0.1. – gtomer May 16 '23 at 12:06
  • Also, when using generic.ContentStream and giving it the additional positional argument, this error is thrown: AttributeError: 'ContentStream' object has no attribute 'beginText'. Which makes me wonder, how you came up with the code, as it is now. – slapslash May 17 '23 at 13:37
  • @gtomer I got this same code asking to chatgpt "how to replace a word in a pdf using python" – celsowm Jul 21 '23 at 03:39
  • @celsowm - and? – gtomer Jul 21 '23 at 15:39

2 Answers2

1

In PyPDF2 document there is no ContentStream property, so you can't use it directly to create a new text object.

PyPDF2 module does not have the property ContentStream, because it is an internal class, not a public API, so you can not directly import it. You need to use PdfFileWriter's _addObject method to add a new ContentStream object, and then use PdfFileWriter's _updateObject method to update the contents of the page.

You can refer to this Stack Overflow answer, which has a sample code, you can add on the PDF watermark function, you can modify a down to achieve the function of replacing the text.

0

You get an AttributeError here for a simple reason: the library you are using is not designed to modify and write PDF files like you're doing.

pypdf is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

This is true for pydf, PyPDF2 and also for PyPDF3.

So, the main focus of this library is not modifying text inside the pdf. Maybe it is somehow possible by following the examples already mentioned here. You can try out and see if this will help. I see some obstacles if you do not have real text, though.

It is absolutely unclear how you came to this code snippet. The ContentStream object simply does not exist (at least not with a begin_text() attribute). Presumably it is a piece of code from another library or possibly from this fork that provides ContentStream under pdf, i.e. PyPDF4.pdf.ContentStream. In any case, the PyPDF libraries in all the variants do not have this along with begin_text() as far as I can see.

To finally fix your code, you have several possibilities. Try the already mentioned solution from this SO like this

for page in pdf_reader.pages:
    data = page.get_contents().get_data()
    data.replace(search_text.encode("utf-8"), replace_text.encode("utf-8"))
    page.get_contents().set_data(data)
    pdf_writer.add_page(page)

or try to achieve your goal not with pypdf(2) alone. Here are some other possibilities you can try out:

Just as a side note: PyPDF2 is going back to the roots, i.e. pypdf is maintained again since version 3.1.0 (see notes). So hopefully, no confusions any more in the future about the different versions and forks.

colidyre
  • 4,170
  • 12
  • 37
  • 53