0

I am trying to separate one large PDF into several smaller PDFs of varying lengths. At first I tried reading the original PDF with a FileInputStream and finding the signature hex strings to split it into smaller files with a FileOutputStream (as I have done with JPGs). However, I can't seem to find what hex string designates different pages in the original.

I've been looking through the iText API for the PdfWriter and PdfReader classes but I'm not exactly sure how to write data from the original to the smaller PDF, let alone how to create a PDF file in the first place.

Which of these approaches makes more sense? Or is there a much simpler, more ideal way?

Nathaniel Ford
  • 20,545
  • 20
  • 91
  • 102
user2484253
  • 11
  • 1
  • 3
  • There is no such *page separation spot* in PDFs. PDF files consist of objects which can reference each other via a cross reference table. Thus, the objects used for a given single page may be spread over the whole file. Furthermore, some of these objects may be used on multiple pages. E.g. embedded fonts or repeating header/footer parts. – mkl Jun 14 '13 at 04:26
  • Perhaps this. [iText in Action: Extracting Page Content](http://itextpdf.com/examples/iia.php?id=277) – Sri Harsha Chilakapati Jun 14 '13 at 04:32
  • @SriHarsha that is code for text extraction from pdfs. – mkl Jun 14 '13 at 05:10

3 Answers3

4

As mentioned in my comment to your question, there are no signature hex strings to split the source PDF at. PDF files consist of objects which can reference each other via a cross reference table. Thus, the objects used for a given single page may be spread over the whole file. Furthermore, some of these objects may be used on multiple pages. E.g. embedded fonts or repeating header/footer parts.

An API understanding the PDF format can create collections of partial documents from a multi-page source PDF, though.

In case of iText have a look at the iText in Action — 2nd Edition example Burst.java. The central code is this:

PdfReader reader = new PdfReader(SOURCE);
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
    document = new Document();
    copy = new PdfCopy(document, new FileOutputStream(String.format(RESULT, ++i)));
    document.open();
    copy.addPage(copy.getImportedPage(reader, i));
    document.close();
}
reader.close();

While this sample creates one result PDF for each single page, the source obviously indicates how to create result PDFs containing page ranges of the original.

mkl
  • 90,588
  • 15
  • 125
  • 265
1

Well if your goal is to split a pdf file's pages here is where you should go about it : click here or just use acrobat (huge app)

but if you still want to use java, I think this will be useful to you (in creating pdf files from text): click here although I never used these libraries but they seem fine ..

and I think this topic will help you find your pdf reader : here

I hope I could help even a little

Community
  • 1
  • 1
CME64
  • 1,673
  • 13
  • 24
  • the website's program isn't as helpful to me since I want the program to detect the different separation spots automatically (it's to shave time off of scanning hundreds of long files every day; currently, I just scan shorter PDF's individually). I think I can implement the auto-detection on my own. The second link helps a lot with handling PDFs, though. I'm not sure how I didn't run across that page after all my searching. Thanks! – user2484253 Jun 14 '13 at 02:03
  • I never used the website really, just the acrobat ,, you're welcome :) – CME64 Jun 14 '13 at 03:14
-1

If you are open to the idea of using a ready-made program, I have used this one to great effect:

PDFTK

It can split, combine, and rotate pages, and even has some built-in logic for specifying the order of the pages when re-combining (and can do it from multiple PDF files).

A.M.
  • 627
  • 2
  • 8
  • 18
  • PdfTk is iText compiled with the GNU Compiler for Java. You might as well use the real thing. Read http://manning.com/lowagie2/samplechapter6.pdf to find out how it's done. – Bruno Lowagie Jun 14 '13 at 06:39