Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
270
votes
15 answers

Merge PDF files

Is it possible, using Python, to merge separate PDF files? Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure. And I may be pushing my luck, but is it possible to…
Btibert3
  • 38,798
  • 44
  • 129
  • 168
115
votes
24 answers

Extract images from PDF without resampling, in python?

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.
matt wilkie
  • 17,268
  • 24
  • 80
  • 115
43
votes
6 answers

How can I remove a URL channel from Anaconda?

Recently I needed to install PyPdf2 to one of my programs using Anaconda. Unfortunately, I failed, but the URLs that was added to Anaconda environment prohibit the updates of all the Conda libraries. Every time I tried to update anaconda it gives…
Mohammad ElNesr
  • 2,477
  • 4
  • 27
  • 44
36
votes
12 answers

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF. Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial…
Jinu Joseph
  • 542
  • 1
  • 4
  • 17
35
votes
5 answers

pypdf Merging multiple pdf files into one pdf

If I have 1000+ pdf files need to be merged into one pdf, from PyPDF2 import PdfReader, PdfWriter writer = PdfWriter() for i in range(1000): filepath = f"my/pdfs/{i}.pdf" reader = PdfReader(open(filepath, "rb")) for page in…
daydaysay
  • 361
  • 1
  • 3
  • 6
30
votes
8 answers

Unable to use pypdf module

I have installed the pyPdf module successfully using the command pip install pydf but when I use the module using the import command I get the following error: enC:\Anaconda3\lib\site-packages\pyPdf\__init__.py in () 1 from pdf import…
Nitin Vijay
  • 435
  • 1
  • 8
  • 14
30
votes
3 answers

How to read line by line in pdf file using PyPdf?

I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows? Here is the code for reading the pdf pages: import pyPdf def getPDFContent(path): content = "" …
Rami Jarrar
  • 4,523
  • 7
  • 36
  • 52
30
votes
7 answers

Retrieve Custom page labels from document with pyPdf

At the moment I'm looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I'm looking into scraping each page for its page number to determine the order it should go in (e.g. if someone split up a book…
SquidneyPoitier
  • 413
  • 1
  • 4
  • 6
27
votes
5 answers

Xref table not zero-indexed. ID numbers for objects will be corrected. won't continue

I am trying to open a pdf to get the number of pages. I am using PyPDF2. Here is my code: def pdfPageReader(file_name): try: reader = PyPDF2.PdfReader(file_name, strict=True) number_of_pages = len(reader.pages) …
JBin
  • 471
  • 1
  • 6
  • 18
27
votes
7 answers

Cropping pages of a .pdf file

I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size. After a quick Google search I found the pyPdf library for python but my experiments…
johannth
  • 291
  • 1
  • 3
  • 5
22
votes
10 answers

How to extract text from pdf in Python 3.7

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just…
RaV1oLLi
  • 529
  • 1
  • 3
  • 9
21
votes
6 answers

How can I decrypt a PDF using PyPDF2?

Currently I am using the PyPDF2 as a dependency. I have encountered some encrypted files and handled them as you normally would (in the following code): from PyPDF2 import PdfReader reader = PdfReader(pdf_filepath) if reader.is_encrypted: …
Jin Lee
  • 361
  • 1
  • 2
  • 10
20
votes
1 answer

Highlight text in a PDF with Python

I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now,…
Katharsis
  • 239
  • 1
  • 2
  • 8
19
votes
5 answers

Generate flattened PDF with Python

When I print a PDF from any of my source PDFs, the file size drops and removes the text boxes presents in form. In short, it flattens the file. This is behavior I want to achieve. The following code to create a PDF using another PDF as a source (the…
MakeCents
  • 742
  • 1
  • 5
  • 15
19
votes
1 answer

How to extract text from a PDF file in Python?

How can I extract text from a PDF file in Python? I tried the following: import sys import pyPdf def convertPdf2String(path): content = "" pdf = pyPdf.PdfFileReader(file(path, "rb")) for i in range(0, pdf.getNumPages()): …
lost
  • 211
  • 1
  • 2
  • 9
1
2 3
96 97