How to get content of a PDF file page by page having base64 of the whole file content?

Question

I have content of a PDF file in base64 like JVBERi0xLjIgDSXi48/T....

How can I parse it to get base64 of each page of it?

Assuming the PDF file has 5 pages. How can I get the content of each page in base64? I already google it but could not find anything. Any help is appreciated.

http://stackoverflow.com/questions/4784825/how-to-read-pdf-files-using-java After that just base64 encode all of the read content. — Round Potato, Jan 11 '15 at 03:19
Thanks @RoundPotato. I prefer not to use any library for that. Do you know any other solution? — Sara, Jan 11 '15 at 03:21

Kurt Pfeifle · Accepted Answer · 2015-01-11T15:15:04.467

In general, it is not even possible to separate the contents of a native PDF file page by page (making it impossible to do so when the file is base64 encoded, as you will see).

The most general structure of a PDF file is, in this order:

PDF header
PDF objects (file body)
PDF xref table (table of contents, giving file offset location for each PDF object)
PDF trailer

You cannot assume that the PDF objects appear in the same order inside the file as the pages do appear inside a PDF viewer.

If you extract a single page, this page itself needs to be a valid PDF document: containing (in this same order) header, objects, xref and trailer, where xref and trailer need to be re-constructed newly so they match the new document (xref and trailer cannot simply be copied from the original document).

For this reason you need to de-code the base64-encoded file completely before you can even think of accessing a single page of the resulting PDF.

To get -- from a 5-page PDF document that has been encoded with base64 -- all individual PDF pages as base64, you have to follow these steps:

De-code the complete base64 file into a valid 5-page PDF document.
Split the 5-page PDF document into 5 separate 1-page PDF documents.
^{(you need to know the "rules of the PDF game" for this, or make use of a PDF library that does know)}
Encode each 1-page PDF document with base64.

score 1 · Answer 2 · edited May 23 '17 at 12:27

1

You might want to clarify your answer. It is not obvious from your wording whether you want to encode in base64 or decode from it.

Assuming you want to decode(since you said you have base64), there are standard libraries available: Decode Base64 data in Java

edited May 23 '17 at 12:27

Community

1
1

answered Jan 11 '15 at 03:28

Round Potato

75
9

I have Base64 of the whole file. Is there a way to split it page-by-page without decoding it? – Sara Jan 11 '15 at 03:30
1

@Sara No, unless it has special delimeters, delimiting each page chunk of base64. – Round Potato Jan 11 '15 at 03:33
yes that is exactly what I am looking for. There should be some standard delimiter which separate each page – Sara Jan 11 '15 at 03:34
1

@Sara either you should know it or add one yourself that does not collide with the base64 character set. You can see the standard set here: http://en.wikipedia.org/wiki/Base64 – Round Potato Jan 11 '15 at 03:40
4

There are no special delimiters in between pages for PDF. Hence there cannot be such a thing for base64 encoded PDFs either. See my answer about the PDF file structure. – Kurt Pfeifle Jan 11 '15 at 10:49

How to get content of a PDF file page by page having base64 of the whole file content?

2 Answers2