Python - split PDF by content into multiple files

Question

Let's assume that I have a PDF file with 300 pages. What it actually has is 100 forms (always 3 pages per form). On the first page of the form, there's a text value that will determine to which output file it will go. This value starts with the letter "G" and 3 numerical values (i.e. "G100". "G201" etc.) And here it starts a problem for me. The forms are mixed up in the PDF. I will show what I mean:

1st page: G100
4th page: G201
7th page: G100
10th page: G256
...
298th page: G100

Based on that I should create an output: "G100.pdf" which will contain pages 1-3, 7-9, 298-300. And the same for each unique type of form. I don't know how many types there will be, how they will be named (aside from the described pattern), and how many page ranges will they have.

Is there any way to accomplish that using python? I've seen some ways to use PyPDF2 to split pages, but I don't know how to get this done in an efficient way in big PDF's with non-contiguous data.

It doesn't. But thanks for taking the time to answer. The way I started coding this was to run regex on each page for "G([0-9]{3})", and then my problem starts cause I should put the next 3 pages to a new pdf file. But after that I might find the same "G" type some pages later, and I should "append" initial file — Makar, Feb 10 '20 at 13:20
I used to do heavy processing of Postscript and PDF files, and found Ghostscript really powerful for that. It comes with a command line interface that could be used from Python through the `subprocess` module (the portable way) and can also be used as a dynamic loaded library, if you want to save to load time of a process when handling many files. But that supposes you already know which pages you want in the resulting file... — Serge Ballesta, Feb 10 '20 at 13:46

Python - split PDF by content into multiple files

0 Answers0