Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

extracting document information (title, author, ...),
splitting documents page by page,
merging documents page by page,
cropping pages,
merging multiple pages into a single page,
encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions

270

votes

15 answers

Merge PDF files

Is it possible, using Python, to merge separate PDF files? Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure. And I may be pushing my luck, but is it possible to…

asked Aug 09 '10 at 22:23

Btibert3

38,798
44
129
168

115

votes

24 answers

Extract images from PDF without resampling, in python?

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.

python image pdf extract pypdf

asked Apr 22 '10 at 19:26

matt wilkie

17,268
24
80
115

votes

6 answers

How can I remove a URL channel from Anaconda?

Recently I needed to install PyPdf2 to one of my programs using Anaconda. Unfortunately, I failed, but the URLs that was added to Anaconda environment prohibit the updates of all the Conda libraries. Every time I tried to update anaconda it gives…

python anaconda channel pypdf

asked Sep 18 '16 at 13:47

Mohammad ElNesr

2,477
4
27
44

votes

12 answers

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF. Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial…

python python-3.x pypdf pdfminer pdf-extraction

asked Apr 16 '19 at 08:54

Jinu Joseph

votes

5 answers

pypdf Merging multiple pdf files into one pdf

If I have 1000+ pdf files need to be merged into one pdf, from PyPDF2 import PdfReader, PdfWriter writer = PdfWriter() for i in range(1000): filepath = f"my/pdfs/{i}.pdf" reader = PdfReader(open(filepath, "rb")) for page in…

python pypdf

asked Jun 14 '13 at 09:07

daydaysay

votes

8 answers

Unable to use pypdf module

I have installed the pyPdf module successfully using the command pip install pydf but when I use the module using the import command I get the following error: enC:\Anaconda3\lib\site-packages\pyPdf\__init__.py in () 1 from pdf import…

python-3.x pypdf

asked Feb 09 '17 at 07:19

Nitin Vijay

votes

3 answers

How to read line by line in pdf file using PyPdf?

I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows? Here is the code for reading the pdf pages: import pyPdf def getPDFContent(path): content = "" …

python pdf pypdf

asked Mar 20 '10 at 04:39

Rami Jarrar

4,523
7
36
52

votes

7 answers

Retrieve Custom page labels from document with pyPdf

At the moment I'm looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I'm looking into scraping each page for its page number to determine the order it should go in (e.g. if someone split up a book…

python pypdf

asked Sep 10 '12 at 23:59

SquidneyPoitier

votes

5 answers

Xref table not zero-indexed. ID numbers for objects will be corrected. won't continue

I am trying to open a pdf to get the number of pages. I am using PyPDF2. Here is my code: def pdfPageReader(file_name): try: reader = PyPDF2.PdfReader(file_name, strict=True) number_of_pages = len(reader.pages) …

python-3.x pypdf

asked Apr 20 '18 at 10:02

JBin

votes

7 answers

Cropping pages of a .pdf file

I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size. After a quick Google search I found the pyPdf library for python but my experiments…

python pdf pypdf

asked Jan 19 '09 at 10:43

johannth

votes

10 answers

How to extract text from pdf in Python 3.7

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just…

python pdf python-3.7 pypdf pdf-extraction

asked Apr 19 '19 at 20:29

RaV1oLLi

votes

6 answers

How can I decrypt a PDF using PyPDF2?

Currently I am using the PyPDF2 as a dependency. I have encountered some encrypted files and handled them as you normally would (in the following code): from PyPDF2 import PdfReader reader = PdfReader(pdf_filepath) if reader.is_encrypted: …

python pdf encryption pypdf

asked Oct 07 '14 at 18:39

Jin Lee

votes

1 answer

Highlight text in a PDF with Python

I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now,…

python pdf search pypdf pdfminer

asked Oct 27 '16 at 15:18

Katharsis

votes

5 answers

Generate flattened PDF with Python

When I print a PDF from any of my source PDFs, the file size drops and removes the text boxes presents in form. In short, it flattens the file. This is behavior I want to achieve. The following code to create a PDF using another PDF as a source (the…

python pdf-generation reportlab pypdf

asked Nov 19 '14 at 17:21

MakeCents

votes

1 answer

How to extract text from a PDF file in Python?

How can I extract text from a PDF file in Python? I tried the following: import sys import pyPdf def convertPdf2String(path): content = "" pdf = pyPdf.PdfFileReader(file(path, "rb")) for i in range(0, pdf.getNumPages()): …

python pypdf

asked Mar 23 '13 at 04:57

lost

2 3

…

96 97 Next