181

In python code, how can I efficiently save a certain page of a PDF as a JPEG file?

Use case: I have a Python flask web server where PDFs will be uploaded and JPEGs corresponding to each page are stored.

This solution is close, but the problem is that it does not convert the entire page to JPEG.

petezurich
  • 9,280
  • 9
  • 43
  • 57
vishvAs vAsuki
  • 2,421
  • 2
  • 18
  • 19
  • 2
    Depending on the image, it may be better to extract as a png. This would apply if the page contains mainly text. – Paul Rooney Jun 18 '20 at 05:54
  • Although generally true, the code using `fitz` that outputs PNG is substantially lower quality than the accepted one using JPG. I suspect the image resolutions are resized per PDF paper size. – Nelson Apr 06 '23 at 02:13

17 Answers17

219

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for count, page in enumerate(pages):
    page.save(f'out{count}.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.

Nelson
  • 2,040
  • 17
  • 23
Keval Dave
  • 2,777
  • 1
  • 13
  • 16
  • 6
    Hi, the poppler is just a zipped file, doesn't install anything, what is one supposed to do with the dll's or the bin files ? – gaurwraith Aug 26 '18 at 21:59
  • @gaurwraith: Use the following [link to poppler](https://blog.alivate.com.au/poppler-windows/). For some reason the link in the description from Rodrigo is not the same as in the github repo. – Tobias Oct 09 '18 at 07:20
  • @Keval Dave Have you installed poppler and tried pdf2image on Windows machine? Which Windows please? – SKR Nov 27 '18 at 15:08
  • @SKR I have used this with windows 10 and 64bit machine. Find installation of poppler in windows from answer. – Keval Dave Nov 29 '18 at 09:56
  • This packages gives a white border to the image so removed it following this [stackoverflow question](https://stackoverflow.com/questions/10615901/trim-whitespace-using-pil?answertab=votes#tab-top) – hru_d May 06 '19 at 14:00
  • I've install it but got error: `jpeg8.dll` not found – Peter.k May 29 '19 at 11:20
  • I've pretty easily run out of memory doing this - anyone know of a way to just convert a single page (without loading the whole thing, then just using [0] or something)? – elPastor Jun 04 '19 at 23:16
  • 2
    @elPastor you can add first_page and last_page in argument of conver_from_path function to convert specified page only – Keval Dave Jun 05 '19 at 09:57
  • Thanks for the heads up on those arguments, however I still get the same issue (I believe it's with memory, the traceback isn't helpful). I'm wondering if `first_page` / `last_page` still requires loading the full PDF into memory and then internally just parses out the required pages. – elPastor Jun 05 '19 at 10:22
  • Is the '500' the dpi? Just wondering what your reason for going to 500 dpi would be, it looks like 300 is the standard. – Sam Jul 25 '19 at 01:09
  • 1
    @Jacob 500 is the dpi. It tradeoff on the resolution required and the computation available. In my experiments, 500 worked well most of the cases while 300 got me low rez images. – Keval Dave Jul 25 '19 at 08:41
  • 3
    I used `conda install -c conda-forge poppler` to install poppler and it worked. – MNA Sep 18 '19 at 08:43
  • 3
    For converting the first page of the PDF and nothing else, this works:`from pdf2image import convert_from_path pages = convert_from_path('file.pdf', 500) pages = convert_from_path('file.pdf', 500, single_file=True) pages[0].save('file.jpg', 'JPEG')` – helgis Nov 12 '19 at 09:37
  • And there is a nice line in poppler docs: "You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path." thought in my case (conda install) it was actually C:\ProgramData\Anaconda3\pkgs\poppler-21.09.0-h24fffdf_1\Library\bin. – Rustam A. Oct 31 '21 at 21:21
  • If using mac, you can install both packages needed using conda `conda install poppler` `conda install pdf2image` – Emad Goahri Nov 06 '21 at 22:30
  • Get stuck on some pdf – Elia Weiss Apr 11 '22 at 09:19
  • 1
    Poppler's license is GPL based. Be careful in the commercial setting! – Shmack Sep 10 '22 at 18:46
  • This is probably the worst way if you are doing it for many pdfs. It stores images in ppms alongwith jpeg which itself are around 50 megabytes for each page of your pdf. It has a known issue of memory overhaul. – Saurabh Mahra Jul 24 '23 at 20:22
137

I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.

import fitz

pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.load_page(0)  # number of page
pix = page.get_pixmap()
output = "outfile.png"
pix.save(output)
doc.close()

Note: The library changed from using "camelCase" to "snake_cased". If you run into an error that a function does not exist, have a look under deprecated names. The functions in the example above have been updated accordingly.

The fitz.Document class supports a context manager initialization:

with fitz.open(pdffile) as doc:
   ...
cards
  • 3,936
  • 1
  • 7
  • 25
JJPty
  • 1,379
  • 1
  • 6
  • 2
  • 2
    Please add explanation to your answer. – Shanteshwar Inde Apr 02 '19 at 17:31
  • 3
    A good library and it installs on Windows 10 without problems (no wheels required). https://github.com/pymupdf – Comrade Che Jan 23 '20 at 09:27
  • 30
    This is the BEST answer. This was the only code that didn't require an additional installation onto my OS. Python scripts should focus on working within the Python system. I did not need to install poppler, pdftoppm, imageMagick or ghostscript, etc. (Python 3.6) – ZStoneDPM Feb 04 '20 at 22:11
  • 6
    Actually it requires another installation (fitz library, imported without even being referred to and its dependencies), this answer is incomplete (like all of the answers at this question) – Tommaso Guerrini Feb 06 '20 at 12:36
  • 1
    @TommasoGuerrini no. From the docs: "The standard Python import statement for this library is import fitz. This has a historical reason..." is another library, something about neuroimaging. The code works as expected. – TEH EMPRAH Feb 18 '20 at 08:49
  • 1
    @JJPty Instead of pdf file taken from the path, can we take from pdfurl? Also, is it possible for the png file to be in-stream data rather than output-png file? – Shubham Agrawal Mar 04 '20 at 06:23
  • 7
    `image = page.getPixmap(matrix=fitz.Matrix(150/72,150/72))` extracts the image at 150 DPI. [Issue question on this topic.](https://github.com/pymupdf/PyMuPDF/issues/181) – Josiah Yoder Jul 20 '20 at 21:21
  • 7
    This solution uses code licensed commercially by Artifix Software, as well as open-source by AGPL licensing. Be wary of using this on your project, especially if it's commercial in nature. You may need to dig deeper into the legal implications. – Milo Persic Mar 07 '21 at 18:44
  • The perfect solution no dependency it needs . no poppler, no want nnothing else – Zain Ul Abidin Apr 02 '22 at 12:16
  • for jpeg, I used `pil_save` instead of `save` – RAbraham Jan 09 '23 at 19:01
  • You saved my life! [tears of joy], tried almost a thousand libraries (`wand`, `svglib`, `cairosvg`, `pdf2image`, `pdf2files`, etc.) Each one needed another program to run, download exe on Windows, sudo on Linux, add to path... But this one is magic!!! you can even use `page.get_pixmap(dpi=300)` to get a 5921×1734 PNG file!!! I'm in love with this . – Ali Abdi Mar 07 '23 at 07:34
  • Although code works, for some reason the extracted image is much lower quality than the original. I can't find anything obvious that would cause the quality degradation. – Nelson Apr 06 '23 at 02:11
37

Using pypdfium2 (v4):

python3 -m pip install "pypdfium2==4" pillow
import pypdfium2 as pdfium

# Load a document
filepath = "tests/resources/multipage.pdf"
pdf = pdfium.PdfDocument(filepath)

# render a single page (in this case: the first one)
page = pdf[0]
pil_image = page.render(scale=4).to_pil()
pil_image.save("output.jpg")

# render multiple pages concurrently (in this case: all)
page_indices = [i for i in range(len(pdf))]
renderer = pdf.render(pdfium.PdfBitmap.to_pil, page_indices=page_indices)
for index, image in zip(page_indices, renderer):
    image.save("output_%02d.jpg" % index)

Advantages:

  • PDFium is liberal-licensed (BSD 3-Clause or Apache 2.0, at your choice)
  • It is fast, outperforming Poppler. In terms of speed, pypdfium2 can almost reach PyMuPDF
  • Returns PIL.Image.Image, numpy.ndarray, or a ctypes array, depending on your needs
  • Is capable of processing encrypted (password-protected) PDFs
  • No mandatory runtime dependencies
  • Supports Python >= 3.6
  • Setup infrastructure complies with PEP 517/518

Wheels are currently available for

  • Windows amd64, win32, arm64
  • macOS x86_64, arm64
  • Linux (glibc 2.26+) x86_64, i686, aarch64, armv7l
  • Linux (musl 1.2+) x86_64, i686

There is a script to build from source, too.

(Disclaimer: I'm the author)

mara004
  • 1,435
  • 11
  • 24
  • 6
    This is the solution that worked best for me since it didn't require any other installation on python 3.9.13 and windows 10. You should add how to import pdfium in your reply: import pypdfium2 as pdfium – Francesco Pettini Jul 25 '22 at 09:49
  • 1
    Added, thanks! I believe it initially was part of the post but might have got lost during an edit. (I updated this reply several times due to API changes.) – mara004 Jul 25 '22 at 10:46
  • @FrancescoPettini AFAIK, pymupdf doesn't require any external dependencies, either. Technically, it's yet a bit better than pypdfium2, so if you don't mind the AGPL, you could give that one a try, too. – mara004 Jul 25 '22 at 11:01
  • installing pymupdf via fitz required me to install frontend, which if I remember correctly required other packages too – Francesco Pettini Jul 25 '22 at 11:06
  • @FrancescoPettini The docs say that pymupdf doesn't have any mandatory external runtime dependencies if installing from the binary wheels. – mara004 Jul 27 '22 at 12:04
  • Trying to use the multi-page render here nets me a "An attempt has been made to start a new process before the current process has finished its bootstrapping phase." – user3896248 Aug 04 '22 at 23:52
  • @user3896248 It looks like you may be calling the function in a special context where it is not possible to set up a new process pool. Consider using the single page renderer, or file a more detailed bug report on GitHub. – mara004 Aug 16 '22 at 18:07
  • 8
    This should be the accepted answer, thanks for your work. No need of any extra installation, `pip install pypdfium2` is enough. – Tim Aug 25 '22 at 13:09
  • This works great, but when using pyinstaller to create an exe, when I run the exe, it can't find "pdfium", which is referring to pypdfium2 (I checked the line that threw the error). Any idea as to how to fix this? – Shmack Sep 09 '22 at 19:25
  • @Shmack pypdfium2 contains a binary extension, and you need to configure pyinstaller to take that along. The pyinstaller docs provide information on how to do this. I never used pyinstaller myself but had a similar issue report once and the user was able to fix it somehow (https://github.com/pypdfium2-team/pypdfium2/issues/120). – mara004 Sep 10 '22 at 13:21
  • @mara004 Yes. I actually figured this out a few hours after I posted this... `--collect-all pypdfium2` as a cmd line option should work, but I settled for `--add-data "C:\Program Files\Python39\Lib\site-packages\pypdfium2\pdfium.dll";.` (The "." at the end is intentional). – Shmack Sep 10 '22 at 17:23
  • @Shmack Great, thanks for letting me know! Actually, I'm intrigued why pyinstaller doesn't automatically include the binary. After all, the file wouldn't be in the package directory if it wasn't needed. – mara004 Sep 10 '22 at 18:33
  • @mara004 That's what I was saying!!! The only reason I stumbled into the answer was because I went searching into Lib\site-packages for "pdfium" because I wasn't importing any libraries called pdfium. I thought that if it was a dependency then I'd see it in site-packages. Low and behold it wasn't, so I thought I'd explore pypdfium2's folder... and what do you know... pdfium.dll. Soooo annoying. – Shmack Sep 10 '22 at 18:42
  • @Shmack Sorry for the inconveniences. I'm wondering if there's anything I can do to improve the situation. pypdfium2's setup code is a bit non-standard because setuptools extensions don't work for external binaries, they're only meant for in-place compilation. That's why we currently have to camouflage binaries as package data. Maybe pyinstaller would work correctly if it was an official extension, but I feel like package data should be included in any case... – mara004 Sep 10 '22 at 19:11
  • Oh, no - you're definitely right. That dll should've been included with pyinstaller - so I mean its not your fault. I can't think of a practical way on your end to alert users that are trying to include the library in pyinstaller (or of the likes) that they'd have to set that flag. I think the best you could get is to include it in the readme - but thats not worth IMO creating another branch in github. One more note. I noticed that when I open a pdf, and get a page with `get_page()` after `pdf.close()`, it doesn't close the page. So a move operation throws an error, because the page is in use. – Shmack Sep 10 '22 at 19:19
  • @Shmack Adding a note to the readme is a good idea, I'll do that. Concerning the problem you mention, I'm afraid I don't quite understand yet. Once you have called `pdf.close()`, no resources associated to that document handle may be accessed anymore, including loaded pages. Objects need to be closed in reverse order compared to loading (i. e. first the page, then the pdf). I'm not sure if I understood your problem correctly, though. If this information isn't sufficient, could you file a bug report on GitHub to elaborate? – mara004 Sep 10 '22 at 21:23
  • No, then I had false assumptions. I figured that when you `close()`d a `PdfDocument()` it'd kill its children, but maybe those instances aren't tied to the `PdfDocument()` at all? IDK, I remember trying to delete the document then the page, but I could've sworn that I did switched it and tried it the other way. But that's neither here nor there, since I got it working. Not that its a huge deal, but maybe in a future release, consider keeping the reference to the children and `close()` them when the parent pdf is `close()`d. Real quick, do you keep vector data on rendering a pdf or rasterize? – Shmack Sep 11 '22 at 03:51
  • May I add a snippet --- you may need to install PILLOW for pypdfium2 to work properly as described in the code example above. I certainly had to. – lb_so Aug 07 '23 at 04:41
  • @lb_so Added, thanks! – mara004 Aug 07 '23 at 15:15
30

The Python library pdf2image (used in the other answer) in fact doesn't do much more than just launching pdttoppm with subprocess.Popen, so here is a short version doing it directly:

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

Here is the Windows installation link for pdftoppm (contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/.

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 4
    Hi, the Windows installation link for pdftoppm is just a buncho of zipped files, what do you have to do with them to make them work ? Thanks! – gaurwraith Aug 27 '18 at 11:05
17

There is no need to install Poppler on your OS. This will work:

pip install Wand

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f.removesuffix(".pdf") + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)
cards
  • 3,936
  • 1
  • 7
  • 25
DevB2F
  • 4,674
  • 4
  • 36
  • 60
  • 19
    [ImageMagick library](http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-on-windows) needs to be installed to work on wand. – Neeraj Gulia Mar 13 '19 at 12:32
  • 4
    I tried this and needed to install Ghostscript as well (using Windows 10 and Python 3.7). Did it and it worked perfectly. – jcf Jul 01 '19 at 07:55
  • 1
    whats the f[:-4] for? its not referenced anywhere else – Ari Sep 14 '19 at 23:27
  • 1
    @Ari f[:-4] will cut of ".pdf" from filename ( string slicing ) to create new filename with other ext. – Fabian Nov 01 '19 at 19:10
13

@gaurwraith, install poppler for Windows and use pdftoppm.exe as follows:

  1. Download zip file with Poppler's latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/ and unzip to a new folder in your program files folder. For example: "C:\Program Files (x86)\Poppler".

  2. Add "C:\Program Files (x86)\Poppler\poppler-0.68.0\bin" to your SYSTEM PATH environment variable.

  3. From cmd line install pdf2image module -> "pip install pdf2image".

  4. Or alternatively, directly execute pdftoppm.exe from your code using Python's subprocess module as explained by user Basj.

@vishvAs vAsuki, this code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:

import os, subprocess

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"

for pdf_file in os.listdir(pdf_dir):

    if pdf_file.endswith(".pdf"):

        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))

Or using the pdf2image module:

import os
from pdf2image import convert_from_path

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

    for pdf_file in os.listdir(pdf_dir):

        if pdf_file.endswith(".pdf"):

            pages = convert_from_path(pdf_file, 300)
            pdf_file = pdf_file[:-4]

            for page in pages:

               page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")
photek1944
  • 139
  • 1
  • 4
8

GhostScript performs much faster than Poppler for a Linux based system.

Following is the code for pdf to image conversion.

def get_image_page(pdf_file, out_file, page_num):
    page = str(page_num + 1)
    command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
               "-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
               pdf_file]
    f_null = open(os.devnull, 'w')
    subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)

GhostScript can be installed on macOS using brew install ghostscript

Installation information for other platforms can be found here. If it is not already installed on your system.

Keval Dave
  • 2,777
  • 1
  • 13
  • 16
  • 2
    Just to let everyone know, Ghostscript is based on AGPL License and might need permissions in case used within commercial projects. For more reference, read https://www.ghostscript.com/license.html. – Abhishek Jain Jul 06 '21 at 18:27
  • How do you get to the conclusion that Ghostscript is "much faster" than Poppler? I can't reproduce this observation in my personal benchmarks. In fact, I found Ghostscript to be slightly slower. – mara004 Apr 14 '22 at 14:19
5

Their is a utility called pdftojpg which can be used to convert the pdf to img

You can found the code here https://github.com/pankajr141/pdf2jpg

from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)

# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)

# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)
duck
  • 2,483
  • 1
  • 24
  • 34
4

One problem everyone will face that is to Install Poppler. My way is a tricky way,but will work efficiently.

1st download Poppler here.

Then extract it and in the code section just add poppler_path=r'C:\Program Files\poppler-0.68.0\bin' (for eg.) like below

from pdf2image import convert_from_path
images = convert_from_path("mypdf.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i, image in enumerate(images):
    fname = 'image'+str(i)+'.png'
    image.save(fname, "PNG")
petezurich
  • 9,280
  • 9
  • 43
  • 57
Rajkumar
  • 530
  • 4
  • 9
  • This will produce an image per page with the i argument. It works really well. Thank you! – Harry Jan 08 '21 at 15:45
3

Here is a function that does the conversion of a PDF file with one or multiple pages to a single merged JPEG image.

import os
import tempfile
from pdf2image import convert_from_path
from PIL import Image

def convert_pdf_to_image(file_path, output_path):
    # save temp image files in temp dir, delete them after we are finished
    with tempfile.TemporaryDirectory() as temp_dir:
        # convert pdf to multiple image
        images = convert_from_path(file_path, output_folder=temp_dir)
        # save images to temporary directory
        temp_images = []
        for i in range(len(images)):
            image_path = f'{temp_dir}/{i}.jpg'
            images[i].save(image_path, 'JPEG')
            temp_images.append(image_path)
        # read images into pillow.Image
        imgs = list(map(Image.open, temp_images))
    # find minimum width of images
    min_img_width = min(i.width for i in imgs)
    # find total height of all images
    total_height = 0
    for i, img in enumerate(imgs):
        total_height += imgs[i].height
    # create new image object with width and total height
    merged_image = Image.new(imgs[0].mode, (min_img_width, total_height))
    # paste images together one by one
    y = 0
    for img in imgs:
        merged_image.paste(img, (0, y))
        y += img.height
    # save merged image
    merged_image.save(output_path)
    return output_path

Example usage: -

convert_pdf_to_image("path_to_Pdf/1.pdf", "output_path/output.jpeg")

dpacman
  • 3,683
  • 2
  • 20
  • 35
  • Just curious, why `for i, img in enumerate(imgs): total_height += imgs[i].height` instead of simply `for img in imgs: total_height += img.height` ? – Vladimir Prudnikov Jul 05 '21 at 09:55
2

I wrote this script to easily convert a folder directory that contains PDFs (single page) to PNGs really nicely.

import os
from pathlib import PurePath
import glob
# from PIL import Image
from pdf2image import convert_from_path
import pdb

# In[file list]

wd = os.getcwd()

# filter images
fileListpdf = glob.glob(f'{wd}//*.pdf')

# In[Convert pdf to images]

for i in fileListpdf:
    
    images = convert_from_path(i, dpi=300)
    
    path_split = PurePath(i).parts
    fileName, ext = os.path.splitext(path_split[-1])
    
    images[0].save(f'{fileName}.png', 'PNG')

Hopefully, this helps if you need to convert PDFs to PNGs!

1

I use a (maybe) much simpler option of pdf2image:

cd $dir
for f in *.pdf
do
  if [ -f "${f}" ]; then
    n=$(echo "$f" | cut -f1 -d'.')
    pdftoppm -scale-to 1440 -png $f $conv/$n
    rm $f
    mv  $conv/*.png $dir
  fi
done

This is a small part of a bash script in a loop for the use of a narrow casting device. Checks every 5 seconds on added pdf files (all) and processes them. This is for a demo device, at the end converting will be done at a remote server. Converting to .PNG now, but .JPG is possible too.

This converting, together with transitions on A4 format, displaying a video, two smooth scrolling texts and a logo (with transition in three versions) sets the Pi3 to allmost 4x 100% cpu-load ;-)

Robert
  • 11
  • 2
-1
from pdf2image import convert_from_path
import glob

pdf_dir = glob.glob(r'G:\personal\pdf\*')  #your pdf folder path
img_dir = "G:\\personal\\img\\"           #your dest img path

for pdf_ in pdf_dir:
    pages = convert_from_path(pdf_, 500)
    for page in pages:
        page.save(img_dir+pdf_.split("\\")[-1][:-3]+"jpg", 'JPEG')
Ari
  • 5,301
  • 8
  • 46
  • 120
  • This would be a better answer if you explained how the code you provided answers the question. – pppery Sep 15 '19 at 00:39
  • 2
    @pppery Python is fairly readable, the comments do indicate the source folder and output folder, the rest reads like english. – Ari Sep 15 '19 at 10:36
-1

Here is a solution which requires no additional libraries and is very fast. This was found from: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html# I have added the code in a function to make it more convenient.

def convert(filepath):
    with open(filepath, "rb") as file:
        pdf = file.read()

    startmark = b"\xff\xd8"
    startfix = 0
    endmark = b"\xff\xd9"
    endfix = 2
    i = 0

    njpg = 0
    while True:
        istream = pdf.find(b"stream", i)
        if istream < 0:
            break
        istart = pdf.find(startmark, istream, istream + 20)
        if istart < 0:
            i = istream + 20
            continue
        iend = pdf.find(b"endstream", istart)
        if iend < 0:
            raise Exception("Didn't find end of stream!")
        iend = pdf.find(endmark, iend - 20)
        if iend < 0:
            raise Exception("Didn't find end of JPG!")

        istart += startfix
        iend += endfix
        jpg = pdf[istart:iend]
        newfile = "{}jpg".format(filepath[:-3])
        with open(newfile, "wb") as jpgfile:
            jpgfile.write(jpg)

        njpg += 1
        i = iend

        return newfile

Call convert with the pdf path as the argument and the function will create a .jpg file in the same directory

moo5e
  • 63
  • 1
  • 7
  • 5
    This technique looks like it extracts images that have been embedded in the file, rather than rasterizing a page of the file as an image which is what the questioner wanted. – Josh Gallagher Mar 20 '20 at 16:43
-1

For a pdf file with multiple pages, the following is the best & simplest (I used pdf2image-1.14.0):

from pdf2image import convert_from_path
from pdf2image.exceptions import (
     PDFInfoNotInstalledError,
     PDFPageCountError,
     PDFSyntaxError
     )
        
images = convert_from_path(r"path/to/input/pdf/file", output_folder=r"path/to/output/folder", fmt="jpg",) #dpi=200, grayscale=True, size=(300,400), first_page=0, last_page=3)
        
images.clear()

Note:

  1. "images" is a list of PIL images.
  2. The saved images in the output folder will have system generated names; one can later change them, if required.
SKG
  • 145
  • 1
  • 8
  • 2
    Why is this "the best" ? – Nik O'Lai Mar 25 '21 at 18:41
  • 1) Fast as, no loop is required. 2) All the required parameters (like dpi, format, grayscale option, size etc.) are processed at one run. 3) Built-in exception handling is there. 4) The core function calling is only a single line statement. 5) You can get images as 'saved' files as well as a 'list' of 'matrices'. – SKG Mar 26 '21 at 12:34
-1

This easy script can convert a folder directory that contains PDFs (single/multiple pages) to jpeg.

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
from os import listdir
from os import system
from os.path import isfile, join, basename, dirname
import shutil

def move_processed_file(file, doc_path, download_processed):
    try:
        shutil.move(doc_path + '/' + file, download_processed + '/' + file)
        pass
    except Exception as e:
        print(e.errno)
        raise
    else:
        pass
    finally:
        pass
    pass


def run_conversion():
    root_dir = os.path.abspath(os.curdir)

    doc_path = root_dir + r"\data\download"
    pdf_processed = root_dir + r"\data\download\pdf_processed"
    results_folder = doc_path

    files = [f for f in listdir(doc_path) if isfile(join(doc_path, f))]

    pdf_files = [f for f in listdir(doc_path) if isfile(join(doc_path, f)) and f.lower().endswith('.pdf')]

    # check OS type
    if os.name == 'nt':
        # if is windows or a graphical OS, change this poppler path with your own path
        poppler_path = r"C:\poppler-0.68.0\bin"
    else:
        poppler_path = root_dir + r"\usr\bin"

    for file in pdf_files:

        ''' 
        # Converting PDF to images 
        '''

        # Store all the pages of the PDF in a variable
        pages = convert_from_path(doc_path + '/' + file, 500, poppler_path=poppler_path)

        # Counter to store images of each page of PDF to image
        image_counter = 1

        filename, file_extension = os.path.splitext(file)

        # Iterate through all the pages stored above
        for page in pages:
            # Declaring filename for each page of PDF as JPG
            # PDF page n -> page_n.jpg
            filename = filename + '_' + str(image_counter) + ".jpg"

            # Save the image of the page in system
            page.save(results_folder + '/' + filename, 'JPEG')

            # Increment the counter to update filename
            image_counter += 1

        move_processed_file(file, doc_path, pdf_processed)


Malki Mohamed
  • 1,578
  • 2
  • 23
  • 40
-3
from pdf2image import convert_from_path

PDF_file = 'Statement.pdf'
pages = convert_from_path(PDF_file, 500,userpw='XXX')

image_counter = 1

for page in pages:

    filename = "foldername/page_" + str(image_counter) + ".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1
Vito Gentile
  • 13,336
  • 9
  • 61
  • 96
  • 4
    Posting a poorly formatted, incorrectly indented answer with no explanation as to how your answer works or what benefits it offers compared to the 13 existing answers, is of very little value as it stands. Please [edit] your answer, fix the formatting (the [formatting help](https://stackoverflow.com/editing-help) can assist you), fix the indentation, and add some explanation. – David Buck Apr 14 '21 at 06:15