I have a PDF file and I want to replace some text within the PDF file and generate new PDF file. How can I do that in Python?
-
1Using pymupdf: https://stackoverflow.com/a/75387883/562769 – Martin Thoma Aug 26 '23 at 10:27
5 Answers
You can try Aspose.PDF Cloud SDK for Python, Aspose.PDF Cloud is a REST API PDF Processing solution. It is paid API and its free package plan provides 50 credits per month.
I'm developer evangelist at Aspose.
import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
# Get App key and App SID from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
copied_file= '02_pages_new.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)
#upload PDF file to storage
pdf_api.copy_file(remote_name,copied_file)
#Replace Text
text_replace = asposepdfcloud.models.TextReplace(old_value='origami',new_value='polygami',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace])
response = pdf_api.post_document_text_replace(copied_file, text_replace_list)
print(response)

- 940
- 5
- 9
disclaimer: I am the author of borb
, the library used in this answer
This example comes straight out of the examples repository, which you can find here.
It starts by creating a PDF document:
#!chapter_007/src/snippet_012.py
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import PageLayout, SingleColumnLayout
from borb.pdf import Table, FixedColumnWidthTable
from borb.pdf import Paragraph
from borb.pdf import PDF
from decimal import Decimal
def main():
# create empty Document
doc: Document = Document()
# add new Page
pge: Page = Page()
doc.add_page(pge)
# set PageLayout
lay: PageLayout = SingleColumnLayout(pge)
# add Table
tab: Table = FixedColumnWidthTable(number_of_columns=2, number_of_rows=3)
tab.add(Paragraph("Name:", font="Helvetica-Bold"))
tab.add(Paragraph("Schellekens"))
tab.add(Paragraph("Firstname:", font="Helvetica-Bold"))
tab.add(Paragraph("Jots"))
tab.add(Paragraph("Title:", font="Helvetica-Bold"))
tab.add(Paragraph("CEO borb"))
tab.set_padding_on_all_cells(Decimal(5), Decimal(5), Decimal(5), Decimal(5))
lay.add(tab)
# store
with open("output.pdf", 'wb') as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
if __name__ == "__main__":
main()
This document should look like this:
It contains a misspelling of my name, which we will correct using the following code:
#!chapter_007/src/snippet_013.py
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleFindReplace
import typing
def main():
# attempt to read a PDF
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# check whether we actually read a PDF
assert doc is not None
# find/replace
doc = SimpleFindReplace.sub("Jots", "Joris", doc)
# store
with open("output2.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
if __name__ == "__main__":
main()
This gives us the following output:
Keep in mind that SimpleFindReplace
does not handle complex re-flow. It only handles cases where you want to replace some text by some other text, without any influence on surrounding text.
You can obtain borb
using pip
or download its source code here.

- 8,483
- 2
- 23
- 54
Have a look in THIS thread for one of the many ways to read text from a PDF. Then you'll need to create a new pdf, as they will, as far as I know, not retrieve any formatting for you.

- 818
- 6
- 19
The documents I am dealing with use subsetted fonts and character maps. I came up with this solution. It translates the search and replacement strings "backwards" and proceeds to modify the actual object stream data:
#!/usr/bin/env python3
import argparse
import pypdf
from typing import Any, Callable, Dict, Tuple, Union, cast
from pypdf.generic import DictionaryObject, NameObject, RectangleObject
from pypdf.constants import PageAttributes as PG
from pypdf._cmap import build_char_map
# from https://github.com/py-pdf/pypdf/blob/27d0e99/pypdf/_page.py#L1546
def get_char_maps(obj: Any, space_width: float = 200.0):
cmaps: Dict[
str,
Tuple[
str, float, Union[str, Dict[int, str]], Dict[str, str], DictionaryObject
],
] = {}
objr = obj
while NameObject(PG.RESOURCES) not in objr:
# /Resources can be inherited sometimes so we look to parents
objr = objr["/Parent"].get_object()
resources_dict = cast(DictionaryObject, objr[PG.RESOURCES])
if "/Font" in resources_dict:
for f in cast(DictionaryObject, resources_dict["/Font"]):
cmaps[f] = build_char_map(f, space_width, obj)
return {cmap[4]["/BaseFont"]:cmap[3] for cmap in cmaps.values()}
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Replace text in a PDF file.')
parser.add_argument('--input')
parser.add_argument('--output')
parser.add_argument('--papersize', default="A4")
parser.add_argument('--font', required=False)
parser.add_argument('--separator', default="→")
parser.add_argument('replacements', nargs='*')
args = parser.parse_args()
replacements = [pair.split(args.separator) for pair in args.replacements]
total_replacements = 0
reader = pypdf.PdfReader(args.input)
writer = pypdf.PdfWriter()
for page_index, page in enumerate(reader.pages):
print(f"Processing page {page_index+1}…")
cmaps = get_char_maps(page)
for fontname in cmaps.keys():
if (not args.font):
print(fontname)
elif (fontname.endswith(args.font)):
args.font = fontname
if (not args.font):
continue
charmap = cmaps[args.font]
reverse_charmap = {v:k for k,v in charmap.items()}
def full_to_subsetted(full):
subsetted = ''.join([reverse_charmap[c] for c in full])
subsetted = subsetted.replace(r'(',r'\(').replace(r')',r'\)') # TODO: which other characters must be escaped?
return subsetted.encode('ascii') # TODO: which encoding is actually used?
subsetted_replacements = [(full_to_subsetted(f),full_to_subsetted(t)) for f,t in replacements]
page_replacements = 0
# based on https://stackoverflow.com/questions/41769120/search-and-replace-for-text-within-a-pdf-in-python#69276885
for content in page.get_contents():
obj = content.get_object()
data = obj.get_data()
for f,t in subsetted_replacements:
while (f in data):
data = data.replace(f, t, 1)
page_replacements += 1
obj.set_data(data)
if (page_replacements > 0):
total_replacements += page_replacements
print(f"Replaced {page_replacements} occurrences on this page.")
papersize = getattr(pypdf.PaperSize, args.papersize)
page.mediabox = RectangleObject((0, 0, papersize.width, papersize.height))
writer.add_page(page)
if (args.output):
writer.write(args.output)
print(f"Replaced {total_replacements} occurrences in document.")
This only works if the text you want to replace is typeset by a single Tj
command using a specific syntax and without any adjustments to the character spacing. The streams must not be compressed. I use qpdf --qdf in.pdf uncompressed.pdf
for decompression. D.Deriso's answer was useful. Maybe my answer here helps someone else, too.

- 604
- 7
- 23
The CAM::PDF Perl Library can output text that's not too hard to parse (it seems to fairly randomly split lines of text). I couldn't be bothered to learn too much Perl, so I wrote these really basic Perl command line scripts, one that reads a single page pdf to a text file perl read.pl pdfIn.pdf textOut.txt
and one that writes the text (that you can modify in the meantime) to a pdf perl write.pl pdfIn.pdf textIn.txt pdfOut.pdf
.
#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";
$pdfIn = $ARGV[0];
$textOut = $ARGV[1];
$pdf = CAM::PDF->new($pdfIn);
$page = $pdf->getPageContent(1);
open(my $fh, '>', $textOut);
print $fh $page;
close $fh;
exit;
and
#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";
$pdfIn = $ARGV[0];
$textIn = $ARGV[1];
$pdfOut = $ARGV[2];
$pdf = CAM::PDF->new($pdfIn);
my $page;
open(my $fh, '<', $textIn) or die "cannot open file $filename";
{
local $/;
$page = <$fh>;
}
close($fh);
$pdf->setPageContent(1, $page);
$pdf->cleanoutput($pdfOut);
exit;
You can call these with python either side of doing some regex etc stuff on the outputted text file.
If you're completely new to Perl (like I was), you need to make sure that Perl and CPAN are installed, then run sudo cpan
, then in the prompt install "CAM::PDF";
, this will install the required modules.
Also, I realise that I should probably be using stdout etc, but I was in a hurry :-)
Also also, any ideas what the format CAM-PDF outputs is? is there any doc for it?

- 1
- 1

- 365
- 3
- 8
-
1There is some more useful documentation here http://search.cpan.org/dist/CAM-PDF/lib/CAM/PDF.pm if I get round to it, I might write some kind of Python wrapper – leontrolski Mar 03 '16 at 16:49