2

My goal is to convert a multi page pdf file into a number of .jpg files, in such a way that the images are directly written to the hard-disk/SSD in stead of stored into memory.

In python 3.11 :

from pdf2image import convert_from_path
poppler_path = r".\poppler-22.12.0\Library\bin"  

images = convert_from_path('test.pdf', output_folder='.', output_file = 'test', 
         poppler_path=poppler_path, paths_only = True)

pdf2image generates files with the following names 'test_0001-1.jpg', 'test_0001-2.jpg', etc

Problem: I would like to have the files have names without the suffix '_0001-' (eg. 'test1.jpg').

The only way so far seems to be to use convert_from_path WITHOUT output_folder and then save each images by images.save. But in this way the images are stored first into memory, which easyly can become a lot of Mbytes.

Is it possible to change the way pdf2image generates the file names when saving images directly to files?

vvvvv
  • 25,404
  • 19
  • 49
  • 81

3 Answers3

0

I'm not familiar if Poppler already has some parameters to customize the generated file names, but you can always do this:

  1. Run the command in an empty directory (e.g. in tempfile.TemporaryDirectory())
  2. After command finishes, list the contents of the directory and store the result in a list
  3. Iterate over the list with a regex that will match the numbers, and create a dict for the mapping (integer to file name)

At this point you are free to rename the files to whatever you like, or to process them.

The benefit of this solution is that it's neutral, robust and works for many similar scenarios.

jurez
  • 4,436
  • 2
  • 12
  • 20
  • Thank you, Jurez, for your reply. Yes, maybe renaming the files like you suggest is for me at this moment the most straightforward solution. – Gardener_NL Jan 07 '23 at 18:43
0

hi have a look at your codebase in file generators.py ,

I got mine from def counter_generator(prefix="", suffix="", padding_goal=4):

at line 41 you have :


....
@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
    """Returns a joined prefix, iteration number, and suffix"""
    i = 0
    while True:
        i += 1
        yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)

....

think you need to play with the yield line zfill() :

The Python String zfill() method is used to fill the string with zeroes on its left until it reaches a certain width; else called Padding. If the prefix of this string is a sign character (+ or -), the zeroes are added after the sign character rather than before.

The Python String zfill() method does not fill the string if the length of the string is greater than the total width of the string after padding.

Note: The zfill() method works similar to the rjust() method if we assign '0' to the fillchar parameter of the rjust() method.

https://www.tutorialspoint.com/python/string_zfill.htm

pippo1980
  • 2,181
  • 3
  • 14
  • 30
  • Thank you Pippo1980 for your suggestion. I also had an idea in your direction, but I was a bit scared to change the code: will the counter generator still be 'threadsafe' (and what does it mean?). In another part of the code the number of threads seem to have an influence on the names, but maybe I make here because my understanding of python is not super. – Gardener_NL Jan 07 '23 at 18:58
  • Dont know where the threadsafe decorator class is defined. In my mind should prevent that the counter runs at the same time giving same name for different images not sure the name of the file is meaningful to that, but I could be wrong I am very naive too – pippo1980 Jan 07 '23 at 19:20
  • Everything is in the same file of counter_generator https://github.com/Belval/pdf2image/blob/d415156659f76f898ff2d5c9e2884212ee7fac76/pdf2image/generators.py#L1 . I believe the decorator is used to put threading locks on the counter function not to iterate the counter at thecsame time from different conversion threads – pippo1980 Jan 07 '23 at 21:13
0

Just use poppler utilities direct (or xpdf pdftopng) so simply call it via a shell (add other options like -r 200 as desired for resolutions other than 150)

I recommend PNG as better image fidelity, however if you want .jpg replace "-png" below with "-jpg" (direct answer as asked would be pdftoppm -jpg -f 1 -l 9 -sep "" test.pdf "test") but do follow the below enhancement for file sorting. Windows file sorting needs leading zeros otherwise sort in zip or folder is 1,10,11...2,20...., which is often undesirable.

"path to bin\pdftoppm" -png "path to \in.pdf" "name"

Result =

  • name-1.png
  • name-2.png etc.

adding digits is limited compared to other apps so if you want "name-01.png" you need to only output pages 1-9 as

\bin>pdftoppm -png -f 1 -l 9 -sep "0" in.pdf "name-"

then for pages 10 to ## use say for up to 99 page file use default (it will only use the page numbers that are available)

\bin>pdftoppm -png -f 10 -l 99 in.pdf "name"

thus for 12 pages this would produce only -10 -11 and -12 as required

likewise, for up to 9999 pages you need 4 calls, if you don't want - simply delete it. For different output directory adjust output accordingly.

set "name=%~dpn1"
set "bin=path to Poppler\Release-22.12.0-0\poppler-22.12.0\Library\bin"

"%bin%\pdftoppm" -png -r 200 -f 1 -l 9 -sep "0" "%name%.pdf" "%name%-00"
"%bin%\pdftoppm" -png -r 200 -f 10 -l 99 -sep "0" "%name%.pdf" "%name%-0"
"%bin%\pdftoppm" -png -r 200 -f 100 -l 999 -sep "0" "%name%.pdf" "%name%-"
"%bin%\pdftoppm" -png -r 200 -f 1000 -l 9999 -sep "" "%name%.pdf" "%name%-"

in say example for 12 page above the worst case would be last calls replies
Wrong page range given: the first page (100) can not be after the last page (12). and same for 1000 Thus, those warnings can be ignored.

Those 4 lines could be in a windows or OS script batch file (for sendto or drag and drop) that accepts arguments then very simply use in system or python by call pdf2png.bat input.pdf for each file and output will in that simple case be same directory.

K J
  • 8,045
  • 3
  • 14
  • 36
  • Thank you K J for your detailed response. Yes, the automatic naming is done outside poppler, so it makes sense if make my own protocol for it. I will start with your suggestion and make some changes if necessary. – Gardener_NL Jan 07 '23 at 19:07