4

I am trying to convert multiple pdfs (10k +) to jpg images and extract text from them. I am currently using the pdf2image python library but it is rather slow, is there any faster/fastest library than this?

from pdf2image import convert_from_bytes
images = convert_from_bytes(open(path,"rb").read())

Note : I am using ubantu 18.04
CPU : 4 core 8 thread ( ryzen 3 3100)
memory : 8 GB

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
Sahil Lohiya
  • 154
  • 2
  • 11

3 Answers3

6

pyvips is a bit quicker than pdf2image. I made a tiny benchmark:

#!/usr/bin/python3

import sys
from pdf2image import convert_from_bytes

images = convert_from_bytes(open(sys.argv[1], "rb").read())
for i in range(len(images)):
    images[i].save(f"page-{i}.jpg")

With this test document I see:

$ /usr/bin/time -f %M:%e ./pdf.py nipguide.pdf 
1991624:4.80

So 2GB of memory and 4.8s of elapsed time.

You could write this in pyvips as:

#!/usr/bin/python3

import sys
import pyvips

image = pyvips.Image.new_from_file(sys.argv[1])
for i in range(image.get('n-pages')):
    image = pyvips.Image.new_from_file(filename, page=i)
    image.write_to_file(f"page-{i}.jpg")

I see:

$ /usr/bin/time -f %M:%e ./vpdf.py nipguide.pdf[dpi=200]
676436:2.57

670MB of memory and 2.6s elapsed time.

They are both using poppler behind the scenes, but pyvips calls directly into the library rather than using processes and temp files, and can overlap load and save.

You can configure pyvips to use pdfium rather than poppler, though it's a bit more work, since pdfium is still not packaged by many distributions. pdfium can be perhaps 3x faster than poppler for some PDFs.

You can use multiprocessing to get a further speedup. This will work better with pyvips because of the lower memory use, and the fact that it's not using huge temp files.

If I modify the pyvips code to only render a single page, I can use gnu parallel to render each page in a separate process:

$ time parallel ../vpdf.py us-public-health-and-welfare-code.pdf[dpi=150] ::: {1..100}
real    0m1.846s
user    0m38.200s
sys 0m6.371s

So 100 pages at 150dpi in 1.8s.

jcupitt
  • 10,213
  • 2
  • 23
  • 39
  • pyvips is very interesting, thanks for mentioning it. However, to my experience, in general pdfium is *considerably* faster than poppler at reandering (though it may vary depending on the PDF). And note, if you only want to use pdfium, that there's also pypdfium2 (disclaimer: I'm the author, but I might not have started the project had I known about pyvips by that time ;) ). – mara004 May 23 '23 at 20:51
  • Oh, interesting. I've not found a document where pdfium is significantly quicker, but perhaps I've been unlucky. pdfium has a much more liberal license, and I think that's the area where it really wins. – jcupitt May 23 '23 at 21:09
  • I don't have the env (and time) to do a benchmark right now, but last time I did this was quite obvious, actually on most if not all documents I tried. I seem to remember using the PDF 1.7 spec and the Cinelerra GG manual as test references, for example. – mara004 May 24 '23 at 12:16
  • I just tested anyway. On my device, rendering CinGG manual takes ~50s with pypdfium2, compared to ~57 (+14%) with pdftoppm, at 300dpi (rsp. scale 4.2) with jpeg as output. It would be interesting to do a pure rendering benchmark that does not include image conversion and disk output, though. – mara004 May 24 '23 at 12:35
  • 1
    I tried with https://cinelerra-gg.org/download/CinelerraGG_Manual.pdf and on that file pyvips gets about 3x faster if you switch from poppler to pdfium, so I agree that's a very nice improvement. I edited my answer to include this info, thanks! – jcupitt May 24 '23 at 13:34
  • Cool, thanks! Concerning PDF benchmarking, perhaps the following two repos might also interest you: https://github.com/py-pdf/benchmarks and https://github.com/ArtifexSoftware/PyMuPDF-performance – mara004 May 24 '23 at 18:35
2

Try the following

  1. pypdfium2
  2. Using the python subprocess, https://blog.alivate.com.au/poppler-windows/
Jitesh
  • 318
  • 1
  • 10
1

Using converters, then speed is generally relative to the file size and complexity, since the content needs fresh build each run. For PDF (your not generating yourself) that can require different solutions, however you are quoting systems that require several steps so "fastest" is the core machine code binary, that is usually the cli version, without any slower wrapping apps.

As a rough rule of thumb 100 x 150dpi png pages per minute is reasonable so a run just started 10 minutes ago has just done 947 pages (e.g. 1.578 pages per second or 0.6336 seconds per page).

In a recent stress test with a single complex page (on kit not too different to yours) the resolution was biggest factor so 1 complex chart page took from 1.6 to 14+ seconds (depending on output resolution) and using multithreading only reduced it to 12 seconds https://stackoverflow.com/a/73060439/10802527

Pdf2image is built around poppler with pdfimages pdftotext & pdftoppm and rather than jpg I would recommend use pdftoppm -png since the results should be crisper thus faster leaner output looking good.

Imagemagick cannot convert without GhostScript nor output text, so the fast route core there is Artifex GhostScript. Also consider/compare with sister application MuPDF (Mutool) it has both Image and Text outputs, Multi-threading and banding.

The core of Chrome/Edge/Chromium and Foxit/Skia solutions are the PDFium binaries that can be found in various forms for different platforms.

some rough times on my kit for a large file all at 150 dpi

poppler/pdftoppm -f 1 -l 100 -png = 100 pages from 13,234 us-public-health-and-welfare-code.pdf
or similar speed
pdftocairo -f 1 -l 100 -png -r 150 us-public-health-and-welfare-code.pdf time/out
The current time is: 17:17:17
The current time is: 17:18:08
100 pages as png = 51 seconds

100+ pages per minute (better than most high speed printers, but over 2 hours for just one file)

PDFium via a cli exe was around 30 seconds for the 100 pages but the resolution would need exif setting thus a second pass, however lets be generous and say that's
Approx. 200 pages per minute (Est. 1 hour 6 mins total)

xpdf pdftopng  with settings for 150dpi x 100 from 13234pages.pdf
The current time is: 17:25:27
The current time is: 17:25:42
100 pages as png = 15 seconds

400 pages per minute (Est. 33 mins total)

MuTool convert -o time/out%d.png -O resolution=150  x 100 from 13234pages.pdf
The current time is: 17:38:14
The current time is: 17:38:25
100 pages as png = 11 seconds

545 pages per minute (Est. 24.3 mins total)

That can be bettered

mutool draw -st -P -T 4 -B 2048 -r 150 -F png -o ./time/out%d.png 13234pages.pdf 1-100
total 5076ms (0ms layout) / 100 pages for an average of 50ms

1,182 pages per minute (Est. 11.2 mins total)

Note a comment by @jcupitt

I tried time parallel mutool convert -A 8 -o page-%d.png -O resolution=150 us-public-health-and-welfare-code.pdf {}-{} ::: {1..100} and it's 100 pages in 600ms. If you use pgm, it's 300ms (!!).

That would be 10,000 or 20,000 pages per minute (Est. 0.66-1.32 mins total)

There are other good libs to render just as quick in the same timeframe, but as generally they demand the one core GPU/CPU/Memory/Fonts etc. then on one device multiple parallel processes can often fail. One application that looked good for the task fell over with memory fail after only 2 pages.
If you must use one device you can try separate invocation's in "Parallel" however my attempts, in native windows, always seemed thwarted by file locks on resources when there were conflicting demands for the bus or support files.
The only reliable way to multiprocessing is batch blocks of sequential sets of files in Parallel devices, so upscale to farming-out across multiple real "CPU/GPU"s and their dedicated drives.

Note this developers comparison where the three best of their bunch were

  1. MuPDF 2) Xpdf 3) PDFium (their selection (as tested above) has more permissive license)
K J
  • 8,045
  • 3
  • 14
  • 36
  • 1
    parallel worked for me, try `parallel pdftoppm us-public-health-and-welfare-code.pdf -png xxx -f {} -l {} ::: {1..100}` ... 2.4s for 100 pages. With pyvips and parallel I see 1.8s. – jcupitt Aug 27 '22 at 05:58
  • 1
    `pdftoppm us-public-health-and-welfare-code.pdf -png xxx -f 1 -l 100`, ie. 100 pages at 150 dpi, is 44s elapsed time, so not far off your one thread timing. This PC has 16 cores 32 threads, and I see an 18x speedup with parallel. You could try WSL2 -- it includes parallel, and has fast disc IO. – jcupitt Aug 27 '22 at 11:06
  • 1
    I'm a developer heh. I tried `time parallel mutool convert -A 8 -o page-%d.png -O resolution=150 us-public-health-and-welfare-code.pdf {}-{} ::: {1..100}` and it's 100 pages in 600ms. If you use pgm, it's 300ms (!!). – jcupitt Aug 27 '22 at 11:18