2

I'm trying to write a function which splits a pdf into separate pages. From this SO answer. I copied a simple function which splits a pdf into separate pages:

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        with open("document-page%s.pdf" % i, "wb") as outputStream:
            output.write(outputStream)
    return pages

This however, writes the new PDFs to file, instead of returning a list of the new PDFs as file variables. So I changed the line of output.write(outputStream) to:

pages.append(outputStream)

When trying to write the elements in the pages list however, I get a ValueError: I/O operation on closed file.

Does anybody know how I can add the new files to the list and return them, instead of writing them to file? All tips are welcome!

Community
  • 1
  • 1
kramer65
  • 50,427
  • 120
  • 308
  • 488
  • 1
    Have you tried reading the data, rather than storing the file handle - `pages.append(outputStream.read())`? – jonrsharpe Oct 23 '14 at 13:32
  • Have you tried using `cStringIO.StringIO` to open `outputStream`? – user4815162342 Oct 23 '14 at 13:37
  • what the user above said... you can usually substitute a `StringIO` object for a file and get the result out as a string that way – Anentropic Oct 23 '14 at 13:40
  • @jonrsharpe - I just tried it, and that gives me a `IOError: File not open for reading` on the line saying `pages.append(outputStream.read())`. Any other ideas? – kramer65 Oct 23 '14 at 13:40
  • @user4815162342 - Ehm, no I haven't tried StringIO. Any tips on how to do that? A code example would be very welcome.. :) – kramer65 Oct 23 '14 at 13:41
  • What is the use case. You want to have a list of file handles to operate on after you called splitPdf? Can`t you just have a list of path instead? – Rod Oct 23 '14 at 14:04

3 Answers3

6

It is not completely clear what you mean by "list of PDFs as file variables. If you want to create strings instead of files with PDF contents, and return a list of such strings, replace open() with StringIO and call getvalue() to obtain the contents:

import cStringIO

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        io = cStringIO.StringIO()
        output.write(io)
        pages.append(io.getvalue())
    return pages
user4815162342
  • 141,790
  • 18
  • 296
  • 355
5

You can use the in-memory binary streams in the io module. This will store the pdf files in your memory.

import io

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        outputStream = io.BytesIO()

        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        output.write(outputStream)

        # Move the stream position to the beginning,
        # making it easier for other code to read
        outputStream.seek(0)

        pages.append(outputStream)
    return pages

To later write the objects to a file, use shutil.copyfileobj:

import shutil

with open('page0.pdf', 'wb') as out:
    shutil.copyfileobj(pages[0], out)
parchment
  • 4,063
  • 1
  • 19
  • 30
1

Haven't used PdfFileWriter, but think that this should work.

def splitPdf(file_):
    pdf = PdfFileReader(file_)
    pages = []
    for i in range(pdf.getNumPages()):
        output = PdfFileWriter()
        output.addPage(pdf.getPage(i))
        pages.append(output)
    return pages

def writePdf(pages):
    i = 1
    for p in pages:
        with open("document-page%s.pdf" % i, "wb") as outputStream:
            p.write(outputStream)
        i += 1
Werner
  • 2,086
  • 1
  • 15
  • 14