Why my code not correctly split every page in a scanned pdf?

Question

Update: Thanks to stardt whose script works! The pdf is a page of another one. I tried the script on the other one, and it also correctly spit each pdf page, but the order of page numbers is sometimes right and sometimes wrong. For example, in page 25-28 of the pdf file, the printed page numbers are 14, 15, 17, are 16. I was wondering why? The entire pdf can be downloaded from http://download304.mediafire.com/u6ewhjt77lzg/bgf8uzvxatckycn/3.pdf

Original: I have a scanned pdf, where two paper pages sit side by side in a pdf page. I would like to split the pdf page into two, with the original left half becoming the earlier of the two new pdf pages. The pdf looks like enter image description here .

Here is my Python script named un2up inspired by Gilles:

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    q = copy.copy(p)
    (w, h) = p.mediaBox.upperRight

    p.mediaBox.upperLeft = (0, h/2)
    p.mediaBox.upperRight = (w, h/2)
    p.mediaBox.lowerRight = (w, 0)
    p.mediaBox.lowerLeft = (0, 0)

    q.mediaBox.upperLeft = (0, h)
    q.mediaBox.upperRight = (w, h)
    q.mediaBox.lowerRight = (w, h/2)
    q.mediaBox.lowerLeft = (0, h/2)

    output.addPage(q)
    output.addPage(p)
output.write(sys.stdout)

I tried the script on a pdf in terminal with command being un2up < page.pdf > out.pdf, but the output out.pdf is not correctly split.

I also checked the values of variables w and h, the output of p.mediaBox.upperRight, and they are 514 and 1224 which don't look right based on their actual ratio.

The file can be downloaded from http://download851.mediafire.com/bdr4sv7v5nzg/raci13ct5w4c86j/page.pdf.

stardt · Accepted Answer · 2011-08-13T01:23:30.593

7

Your code assumes that p.mediaBox.lowerLeft is (0,0) but it is actually (0, 497)

This works for the file you provided:

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
for i in range(input.getNumPages()):
    p = input.getPage(i)
    q = copy.copy(p)

    bl = p.mediaBox.lowerLeft
    ur = p.mediaBox.upperRight

    print >> sys.stderr, 'splitting page',i
    print >> sys.stderr, '\tlowerLeft:',p.mediaBox.lowerLeft
    print >> sys.stderr, '\tupperRight:',p.mediaBox.upperRight

    p.mediaBox.upperRight = (ur[0], (bl[1]+ur[1])/2)
    p.mediaBox.lowerLeft = bl

    q.mediaBox.upperRight = ur
    q.mediaBox.lowerLeft = (bl[0], (bl[1]+ur[1])/2)
    if i%2==0:
        output.addPage(q)
        output.addPage(p)
    else:
        output.addPage(p)
        output.addPage(q)

output.write(sys.stdout)

edited Aug 13 '11 at 01:23

answered Aug 13 '11 at 00:43

stardt

1,179
1
9
14

Thanks! It works! The pdf is a page of another one. I tried the script on the other one, and it also correctly spit each pdf page, but the order of page numbers is sometimes right and sometimes wrong. For example, in page 25-28 of the pdf file, the printed page numbers are 14, 15, 17, are 16. I was wondering why? The entire pdf can be downloaded from http://download304.mediafire.com/u6ewhjt77lzg/bgf8uzvxatckycn/3.pdf – Tim Aug 13 '11 at 00:57
@Tim I updated the code so that it reverses the order of the split for every other page. This splits your file correctly. – stardt Aug 13 '11 at 01:24
Thanks! (1) Do you know why we need to switch page `p` and `q` every two pages? Is this common to other pdf files, or just specific to this one? (2) I was also wondering how to understand the coordinate system on a pdf page, i.e. is p.mediaBox.lowerLeft the actual lowerleft or the upperright that we see when viewing the pdf file? Is the first coordinate along the horizontal or vertical direction that we see? – Tim Aug 13 '11 at 01:33
(2) The docs at http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html say that mediaBox defines "the boundaries of the physical medium on which the page is intended to be displayed or printed." Print preview in adobe reader shows the pages in landscape mode, so it seems that the coordinates are for a portrait page. I suspect the answer to (1) is also something about page orientation, but I can't find enough documentation. – stardt Aug 13 '11 at 02:03

score 1 · Answer 2 · answered Apr 01 '13 at 10:37

@stardt's code was quite useful, but I had problems to split a batch of pdf files with different orientations. Here's a more general function that will work no matter what the page orientation is:

import copy
import math
import pyPdf

def split_pages(src, dst):
    src_f = file(src, 'r+b')
    dst_f = file(dst, 'w+b')

    input = pyPdf.PdfFileReader(src_f)
    output = pyPdf.PdfFileWriter()

    for i in range(input.getNumPages()):
        p = input.getPage(i)
        q = copy.copy(p)
        q.mediaBox = copy.copy(p.mediaBox)

        x1, x2 = p.mediaBox.lowerLeft
        x3, x4 = p.mediaBox.upperRight

        x1, x2 = math.floor(x1), math.floor(x2)
        x3, x4 = math.floor(x3), math.floor(x4)
        x5, x6 = math.floor(x3/2), math.floor(x4/2)

        if x3 > x4:
            # horizontal
            p.mediaBox.upperRight = (x5, x4)
            p.mediaBox.lowerLeft = (x1, x2)

            q.mediaBox.upperRight = (x3, x4)
            q.mediaBox.lowerLeft = (x5, x2)
        else:
            # vertical
            p.mediaBox.upperRight = (x3, x4)
            p.mediaBox.lowerLeft = (x1, x6)

            q.mediaBox.upperRight = (x3, x6)
            q.mediaBox.lowerLeft = (x1, x2)

        output.addPage(p)
        output.addPage(q)

    output.write(dst_f)
    src_f.close()
    dst_f.close()

I can't seem to get this one working. Unlike the one by @stardt, this one produces an empty file. Any ideas @moraes? — Brian Z, Oct 01 '13 at 03:20

score 0 · Answer 3 · answered Aug 14 '13 at 10:18

I'd like to add that you have to pay attention that your mediaBox variables are not shared across the copies p and q. This can easily happen if you read from p.mediaBox before taking the copy.

In that case, writing to e.g. p.mediaBox.upperRight may modify q.mediaBox and vice versa.

@moraes' solution takes care of this by explicitly copying the mediaBox.

Why my code not correctly split every page in a scanned pdf?

3 Answers3

Linked

Related