35

If I have 1000+ pdf files need to be merged into one pdf,

from PyPDF2 import PdfReader, PdfWriter

writer = PdfWriter()

for i in range(1000):
    filepath = f"my/pdfs/{i}.pdf"
    reader = PdfReader(open(filepath, "rb"))
    for page in reader.pages:
        writer.add_page(page)

with open("document-output.pdf", "wb") as fh:
    writer.write(fh)

Execute the above code,when reader = PdfReader(open(filepath, "rb")),

An error message: IOError: [Errno 24] Too many open files:

I think this is a bug, If not, What should I do?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
daydaysay
  • 361
  • 1
  • 3
  • 6

5 Answers5

75

I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.

Note: I am assuming that filename is a well-formed file path string. Assume the same for all of my code

The Short Answer

Use the PdfFileMerger() class instead of the PdfFileWriter() class. I've tried to provide the following to as closely resemble your content as I could:

from PyPDF2 import PdfFileMerger, PdfFileReader

[...]

merger = PdfFileMerger()
for filename in filenames:
    merger.append(PdfFileReader(file(filename, 'rb')))

merger.write("document-output.pdf")

The Long Answer

The way you're using PdfFileReader and PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader (hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle. They remain open until PdfFileWriter no longer needs access to them, which is at output.write(outputStream) in your code.

To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at PdfFileMerger wasn't close enough, and that it only created copies in certain conditions.

My initial attempts looked like the following, and were resulting in the same IO Problems:

merger = PdfFileMerger()
for filename in filenames:
    merger.append(filename)

merger.write(output_file_path)

Looking at the PyPDF2 source code, we see that append() requires fileobj to be passed, and then uses the merge() function, passing in it's last page as the new files position. merge() does the following with fileobj (before opening it with PdfFileReader(fileobj):

    if type(fileobj) in (str, unicode):
        fileobj = file(fileobj, 'rb')
        my_file = True
    elif type(fileobj) == file:
        fileobj.seek(0)
        filecontent = fileobj.read()
        fileobj = StringIO(filecontent)
        my_file = True
    elif type(fileobj) == PdfFileReader:
        orig_tell = fileobj.stream.tell()   
        fileobj.stream.seek(0)
        filecontent = StringIO(fileobj.stream.read())
        fileobj.stream.seek(orig_tell)
        fileobj = filecontent
        my_file = True

We can see that the append() option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location. The end result is the exact same thing we're trying to avoid. A PdfFileReader() object holding open a file until the file is eventually written!

However, if we either make a file object of the file path string or a PdfFileReader(see Edit 2) object of the path string before it gets passed into append(), it will automatically create a copy for us as a StringIO object, allowing Python to close the file.

I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReader object may stay open in memory, even after calling writer.close().

Hope this helped!

EDIT: I assumed you were using PyPDF2, not PyPDF. If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.

If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMerger won't be available to you. In that situation you can re-use the code from PyPDF2's merge function (provided above) to create a copy of the file as a StringIO object, and use that in your code in place of the file object.

EDIT 2: Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks @Agostino).

Rejected
  • 4,445
  • 2
  • 25
  • 42
  • 1
    I'll be honest; I haven't read the long answer. Short answer was great though. – brad-tot Oct 18 '13 at 16:58
  • 2
    I noticed I couldn't delete some of the files appended creating an intermediate `PdfFileReader` object with the call `writer.append(PdfFileReader(file(filename, 'rb')))`. They remain locked even after calling `writer.close()`. The simpler call `merger.append(file(filename, 'rb'))` does not seem to have the same problem. – Agostino Jun 01 '15 at 23:21
  • 1
    Wouldn't this run into memory problem if the files are too big? – Nishant Apr 03 '16 at 13:33
  • @Nishant As with any objects you're creating in memory, yes. Realistically, if you're getting into the gigabytes for a single PDF file, there's likely a better solution. – Rejected Apr 04 '16 at 14:51
  • 1
    @Rejected Alright thanks, its worth knowing that. A small utility function that choses named temporary file vs memory is a good solution I have seen on this. – Nishant Apr 04 '16 at 14:55
  • @Nishant I doubt that would solve the problem, seeing as how the whole issue comes from a hard limit Python places on the number of open files you can have at the same time. Furthermore, PyPDF will still need to open the new "merged" file even if you're doing this incrementally with files from the HDD. Unless you're using another tool to reduce the PDF file size (stripping duplicate fonts, reducing image quality, etc.), you'll still see about the same memory footprint. – Rejected Apr 04 '16 at 15:02
  • Oh yes good point, something to think. This does circumvent the too many files open error which actually I faced in production once. – Nishant Apr 04 '16 at 17:22
  • 1
    This works but consumes a LOT of memory when working on a lot of files. I'm currently merging 2500 PDFs into one. Each PDF has 4 pages. The end result is a 10,000 page PDF. ...and my server just crashed. haha, too much memory. – teewuane Aug 01 '16 at 21:32
  • 2
    @Rejected I believe for Python 3, instead of `file` in `merger.append(PdfFileReader(file(filename, 'rb')))`, you'll need to use `open`. Like `merger.append(PdfFileReader(open(filename, 'rb')))`. – Hiebs915 Mar 30 '22 at 22:51
  • 1
    This was exactly my question and with 'open' it works. Nice to see that this comment is 17 hours old, although the original post was 7 years old. – Thomas Mar 31 '22 at 16:10
  • I wish the PyPDF documentation bothered to describe the `PdfMerger` constructor parameters. What is `classpypdf.PdfMerger(strict: bool = False, fileobj: Union[Path, str, IO] = '')` supposed to mean? `Union`? – Tom Russell Mar 24 '23 at 03:34
3

The pdfrw package reads each file all in one go, so will not suffer from the problem of too many open files. Here is an example concatenation script.

The relevant part -- assumes inputs is a list of input filenames, and outfn is an output file name:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

Disclaimer: I am the primary pdfrw author.

Patrick Maupin
  • 8,024
  • 2
  • 23
  • 42
3

I have written this code to help with the answer:-

import sys
import os
import PyPDF2

merger = PyPDF2.PdfFileMerger()

#get PDFs files and path

path = sys.argv[1]
pdfs = sys.argv[2:]
os.chdir(path)


#iterate among the documents
for pdf in pdfs:
    try:
        #if doc exist then merge
        if os.path.exists(pdf):
            input = PyPDF2.PdfFileReader(open(pdf,'rb'))
            merger.append((input))
        else:
            print(f"problem with file {pdf}")

    except:
            print("cant merge !! sorry")
    else:
            print(f" {pdf} Merged !!! ")

merger.write("Merged_doc.pdf")

In this, I have used PyPDF2.PdfFileMerger and PyPDF2.PdfFileReader, instead of explicitly converting the file name to file object

Souravi Sinha
  • 93
  • 1
  • 2
  • 11
2

The problem is that you are only allowed to have a certain number of files open at any given time. There are ways to change this (http://docs.python.org/3/library/resource.html#resource.getrlimit), but I don't think you need this.

What you could try is closing the files in the for loop:

input = PdfFileReader()
output = PdfFileWriter()
for file in filenames:
   f = open(file, 'rb')
   input = PdfFileReader(f)
   # Some code
   f.close()
sgillis
  • 242
  • 2
  • 6
0

It maybe just what it says, you are opening to many files. You may explicitly use f=file(filename) ... f.close() in the loop, or use the with statement. So that each opened file is properly closed.

flyingfoxlee
  • 1,764
  • 1
  • 19
  • 29