0

I am trying to split 20 pages of pdf file (single) , into five respective pdf files , 1st pdf contains 1-3 pages , 2nd pdf file contains only 4th page, 3rd pdf contains 5 to 10 pages, 4th pdf contains 11-17 pages , and 5th pdf contains 18-20 page . I need the working code in python. The below mentioned code splits the entire pdf file into single pages, but I want the grouped pages..

    from PyPDF2 import PdfFileWriter, PdfFileReader
    inputpdf = PdfFileReader(open("input.pdf", "rb"))
    for i in range(inputpdf.numPages):
    j = i+1    
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    with open("page%s.pdf" % j, "wb") as outputStream:
    output.write(outputStream)

3 Answers3

3

For me it looks like task for pdfrw using this example from GitHub I written following example code:

from pdfrw import PdfReader, PdfWriter
pages = PdfReader('inputfile.pdf').pages
parts = [(3,6),(7,10)]
for part in parts:
    outdata = PdfWriter(f'pages_{part[0]}_{part[1]}.pdf')
    for pagenum in range(*part):
        outdata.addpage(pages[pagenum-1])
    outdata.write()

This one create two files: pages_3_6.pdf and pages_7_10.pdf each with 3 pages i.e. 3,4,5 and 7,8,9. Note pagenum-1 in code, that -1 is used due to fact that pdf pages numeration starts at 1 rather than 0. I also used so-called f-strings to get names of output files. In my opinion it is slick method but it is not available in Python2 and I am not sure if it is available in all Python3 versions (I tested my code in 3.6.7), so you might use old formatting method instead if you wish. Remember to alter filenames and ranges accordingly to your needs.

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • parts = [(1,3),(4),(5,10),(11,17),(18,20)] for part in parts: outdata = PdfWriter(f'pages_{part[0]}_{part[1]}.pdf') for pagenum in range(*part): outdata.addpage(pages[pagenum-1]) outdata.write() the split code is not working for the above case kindly help. – Sutirtha Thakur Apr 11 '19 at 06:01
  • @SutirthaThakur: `parts` have to be `list` of 2-`tuple`s so `(4)` is not legal. You should use `(4,5)` instead. Also keep in mind `(1,3)` means pages 1,2 and `(4,5)` means page 4. – Daweo Apr 11 '19 at 07:06
  • parts = [(1,4),(4,5),(5,10),(10,20)] when I am entering this I am getting IndexError: list index out of range – Sutirtha Thakur Apr 11 '19 at 09:20
  • @SutirthaThakur: please check if your .pdf file actually have so many pages, I do not see any other possible reason for `IndexError`. – Daweo Apr 11 '19 at 10:39
  • It contains 20 pages only – Sutirtha Thakur Apr 11 '19 at 10:41
  • Please add line `print(len(pages))` below `pages = PdfReader...` this will show how many pages were actually readed. – Daweo Apr 11 '19 at 11:05
  • its saying 12 but the it should be 20 , ideally speaking – Sutirtha Thakur Apr 11 '19 at 12:02
  • Then this mean `PdfReader` for some reason did not load whole .pdf, it is beyond my capability to solve this issue – Daweo Apr 11 '19 at 12:30
  • input_file = PyPDF2.PdfFileReader('input.pdf') this works fine – Sutirtha Thakur Apr 12 '19 at 08:58
-1

if you have python 3, you can use tika according to the following answer here:

How to extract text from a PDF file?

Sa'ad
  • 1
  • 3
-1

How to extract specific pages (or split specific pages) from a PDF file and save those pages as a separate PDF using Python.

pip install PyPDF2 # to install module/package

from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')

pdf = PdfFileReader(pdf_file_path)

pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfFileWriter()

for page_num in pages:
    pdfWriter.addPage(pdf.getPage(page_num))

with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
    pdfWriter.write(f)
    f.close()

CREDIT : How to extract PDF pages and save as a separate PDF file using Python

thrinadhn
  • 1,673
  • 22
  • 32