how do I turn a content into a stream in pdfs?

Question

I need to make a pdf editor using PyPDF2. but sadly, there are around 4-6 videos around this module and they all show how to edit and manipulate the general screen, not the pdf. so i used the documentation on it's own in order to learn how to use it. i was able to do most stuff with the documentation alone, but once i reached the point of editing text, i couldn't find any way to do so.

here is my current try of editing a pdf's content:

import PyPDF2

pdf_file = open('pdf name goes here', 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_file)
# Get the page that you want to modify
page = pdf_reader.pages[0]

content_object = page["/Contents"].get_object()
content = content_object.get_data()

modified_content = content + b"\n(new text)"

new_content_object = # i don't know how to create the new content object ):

page.__setitem__("/Contents", new_content_object)

pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addPage(page)
with open('output.pdf', 'wb') as pdf_output:
    pdf_writer.write(pdf_output)

as you can see, my issue is that i don't know how to create a new content object. however, if anybody could suggest me a python module to edit text, i would be very happy. thanks!

I don't think you can just put new text into an object like that in the first place... — AKX, May 04 '23 at 12:55
Anyway, PyPDF2 seems to be the wrong tool anyway: calling `set_data` says "Creating EncodedStreamObject is not currently supported"... — AKX, May 04 '23 at 13:04

score 0 · Answer 1 · answered May 04 '23 at 13:11

0

You can use the Canvas object from reportlab to add text and merge afterwards the two pdf's. Here it is explained how to do it. Or here they use fpdf to replace the text in your file.

answered May 04 '23 at 13:11

ramsluk

104
7

score 0 · Answer 2 · answered May 04 '23 at 15:00

disclaimer: I am the author of borb, the library used in this answer.

Many PDF libraries out there simply don't make it easy to add content to a PDF. PDF is not an easy format, and most libraries simply pass that difficulty on to the user.

Such as:

forcing you to calculate specific coordinates for content
having you manipulate content streams directly
not automatically breaking text

If you can change the tool you're working with, try using borb.

pip install borb

and then you can do something like:

from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import SingleColumnlayout
from borb.pdf import Paragraph
from borb.pdf import PDF

# create an empty Document
doc = Document()

# add an empty Page
page = Page()
doc.add_page(page)

# use a PageLayout to be able to automatically add
# content whilst taking into account margin, previous content
# on the page, etc
layout = SingleColumnLayout(page)

# add a Paragraph
layout.add(Paragraph("Hello there!"))

# add a second Paragraph
layout.add(Paragraph("This content is going to be added neatly beneath the first paragraph."))

# store the PDF
with open("output.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, doc)

You can find more documentation in the (examples) GitHub repository.

how do I turn a content into a stream in pdfs?

2 Answers2