How to decode PDF file and encode it back?

Question

My overall goal is to make some PDF files conform to the PDF/A standard for archival purposes. They fail one requirement, namely that some glyph mappings map to 0, which they should not.

My usual strategy was to use an old software called "Pdfedit" that could decode PDF-Files, all the byte-streams would then be human-readable, edit the relevant part of the PDF containing the glyph mappings, and open the file with Adobe Acrobat that automatically re-encoded it.

Now I have some PDFs that cause "Pdfedit" to crash upon opening. I tried using PDF-Parser but its output cannot be re-encoded by Adobe Acrobat.

Also, the relevant parts used to look like this decoded:

/CMapType 2 def
 1 begincodespacerange
 <00><04>
 endcodespacerange
 5 beginbfchar
 <00><0000>
 <01><0000>
 <02><263A>
 <03><0000>
 <04><0000>
 endbfchar
 endcmap

But now I use the following command python3 pdf-parser.py -f -n /path/to/file.pdf > dump.txt and inside dump.txt the relevant part looks like this:

b'/CMapType 2 def\n1 begincodespacerange\n<00><04>\nendcodespacerange\n5 beginbfchar\n<00><0000>\n<01><0000>\n<02><263A>\n<03><0000>\n<04><0000>\nendbfchar\nendcmap\nCMapName currentdict/CMap defineresource pop end end'

So it is a bytestring and any linebreak is rendered literally as \n. The txt file that contains this cannot be interpreted as a PDF by Adobe Acrobat.

I have now also realized that many elements such as %%EOF are delimited by ''.

The true issue is how to get an Acrobat-readable output from pdf-parser.py, as the shell-command > does not work and stdout in the shell is also faulty.

I will try out a few things but could really need some help on this!

By *"But now, it looks like this"* do you mean *now using the other tool* or *now in these new pdfs*? — mkl, Sep 02 '20 at 04:57
The new tool outputs a txt file that contains this part. The whole txt file cannot be understood and re-encoded by Adobe Acrobat. I say this because with the old decoding software, after manually changing things, I could open the relevant txt file with Adobe Acrobat to create working pdf file. Edited my post now. — Smogshaik, Sep 03 '20 at 13:06
So it looks like that functionality of the new tool is not meant for re-encoding PDFs for further use but merely for debugging purposes. — mkl, Sep 03 '20 at 16:25

score 1 · Answer 1 · answered Sep 10 '20 at 15:42

Answering my own question in case this is relevant for someone down the line.

Didier Stevens, the dev behind the pdf-parser, answered that his tool is not made for this. He recommended qpdf instead.

That was indeed the solution. Make sure you use the flag --stream-data=uncompress so that compressed parts are also accessible in the output. The command to use with qpdf is:

qpdf old_file.pdf --stream-data=uncompress --decode-level=all new_file.txt

You can output new_file also as .pdf. In any case you will be able to open it in the text editor. Once you're done applying the changes you wish to apply, you can change the ending to pdf and process it further with acrobat or any other conversion program.

How to decode PDF file and encode it back?

1 Answers1