-2

I've been looking a way to obtain the source code of a PDF file, not the HEX code but a plain text code, my intention is to code a PDF file from plain text, that way I can create a PDF report with a ESP32 or maybe an Arduino board, uploading the source code to a program, save it to an SD card and rename it with a .pdf extension.

I know it's more complicated than just add lines and Strings like you would do with an HTML document. If I add or delete an object the file will be corrupted, but the plan is to generate a "PDF Layout just like this one:

PDF Layout Example
PDF Layout Table Example

That way I wouldn't be deleting or adding any objects, just modifying the String that already exists. I found I can generate PDF files from a text editor like NotePad using plain text like this example:

    %PDF-1.4
1 0 obj
  << /Type /Catalog
      /Outlines 2 0 R
      /Pages 3 0 R
  >>
endobj

2 0 obj
  << /Type /Outlines
      /Count 0
  >>
endobj

3 0 obj
  << /Type /Pages
      /Kids [ 4 0 R ]
      /Count 1
  >>
endobj

4 0 obj
  << /Type /Page
      /Parent 3 0 R
      /MediaBox [ 0 0 612 792 ]
      /Contents 5 0 R
      /Resources << /ProcSet 6 0 R
      /Font << /F1 7 0 R >>
  >>
>>
endobj

5 0 obj
  << /Length 73 >>
stream
  BT
    /F1 24 Tf
    100 100 Td
    ( Hello World ) Tj
  ET
endstream
endobj

6 0 obj
  [ /PDF /Text ]
endobj

7 0 obj
  << /Type /Font
    /Subtype /Type1
    /Name /F1
    /BaseFont /Helvetica
    /Encoding /MacRomanEncoding
  >>
endobj

xref
0 8
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000120 00000 n
0000000179 00000 n
0000000364 00000 n
0000000466 00000 n
0000000496 00000 n

trailer
  << /Size 8
    /Root 1 0 R
  >>
startxref
625
%%EOF

So I've been searching a way to extract that kind of code from my PDF layout but I've been only capable of extracting the HEX code which is kind of useless for my purpose. I would be grateful on any help or guidance on this project.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • https://stackoverflow.com/questions/7145778/how-to-replace-text-in-a-pdf-with-c check existing templating methods – Abel Aug 21 '21 at 16:50
  • Open your PDF in a text editor and you have your text representation of the document (plus those sections that cannot be represented in text). – IInspectable Aug 21 '21 at 16:57
  • Thanks for your replies. I've already tried to do that but i only get characters like these: xœ•X]Sã6}ϯÐ#ûP¡K¶ó–(Ûiv[0ÙÎÎì˜Ä€wb›Úvwÿ@¯,Ù–B,›°,–9çèêèêã I'm sure there's a way to get a readeable, modifiable code from a PDF but i don't know much about the PDF lenguage... – Diego Estrada Aug 21 '21 at 17:10
  • A PDF file is binary. Byte offsets are used with a map to attempt layout the page and many other things. This requires a composition engine to lay down text, kern characters, deter.ine proper line breaking and pagination. You could never code up more than a simple hello world without understanding all this. A better path would be Java on ESP and then rewrite FOP to run on this stripped down platform – Kevin Brown Aug 21 '21 at 17:23
  • Pdf is s binary format. Yes, you can try to restrict yourself and not use compression or embed other binary data so it *looks* like pure text in a text editor. But that is not corresponding to some "pdf source code", it merely is restriction to a small subset of what pdf allows. – mkl Aug 21 '21 at 17:24
  • I'haven't considered running Java on ESP, that is pretty enlightening, thanks! i thought i could just simply extract a "source code" from a PDF file. Thanks for all of this info! :)) – Diego Estrada Aug 21 '21 at 17:37

2 Answers2

0

I found a solution with the Software PDFEdit. http://pdfedit.cz/en/pdfedit_windows.html There is an option called Decode in the Debug Section, it generates a .decode file wich i then opened with NotePad, i was able to get a readeable, modifiable code(modifiable in the parts that i needed, such as Dates,Hour Names, Temperatures, Routes, etc). You can try it and modify some text, then save it as a .pdf and you will be able to se the changes. Evidence: Original Document unmodified "Source Code" viewed in NotePad after changing some text i saved it as .pdf and saw the changes i wanted. [Documen modified with NotePad3

The code is really extensive(5000 lines) but maybe i can generate some really simple template and reduce this lines. Thanks To everyone!

0

For what you propose one potential solution is MuPDF/MuTool If you wish to decompile An existing PDF there are options in MuPDF-GL for windows using option A to convert to Ascii and "PrettyPrint"

You can write your own PDF as text but it can have limitations this is accepted as a working PDF

%PDF-1.2 4 0 obj << >> stream BT/ 36 Tf((Hello World!))' ET endstream endobj 3 0 obj << /Type /Page /Parent 2 0 R /Contents 4 0 R >> endobj 2 0 obj << /Kids [3 0 R ] /Count 1 /Type /Pages /MediaBox [ -195 -442 400 400 ] >> endobj 1 0 obj << /Pages 2 0 R /Type /Catalog >> endobj trailer << /Root 1 0 R > %%EOF

courtesy of Thomas see Create Memorystream of type pdf and return to browser

If you are "Hand balling" with UTF 16 chars on a "small device" it becomes a step harder see https://stackoverflow.com/a/68442444/10802527

More useful to producing your own many RaspberryPi users Compile PDF via MuTool Create https://mupdf.com/docs/manual-mutool-create.html

The Input Text to be translated during compilation is much simpler especially for image handling

%%MediaBox 0 0 612 792
%%Font TmRm Times-Roman
%%Font Helv-C Helvetica Cyrillic
%%Font Helv-G Helvetica Greek
%%Image I0 logo/ClientLogo.png

% Draw the image.
q
480 0 0 480 50 250 cm
/I0 Do
Q

% Draw a triangle. (Can be rectangles or a grid etc)
q
1 0 0 rg
50 50 m
100 200 l
200 50 l
f
Q

% Show some text. (Remember we humans work downwards, so 50 in then 760,730,700, etc. downwards)
q
0 0 1 rg
BT /TmRm 24 Tf 50 760 Td (Hello, from EPS32!) Tj ET
BT /Helv-C 24 Tf 50 730 Td <fac4d2c1d7d3d4d7d5cad4c521> Tj ET
BT /Helv-G 24 Tf 50 700 Td ( I am Line 3) Tj ET
Q
K J
  • 8,045
  • 3
  • 14
  • 36
  • Thanks a lot, it is maginifcent! Yes, i'm "hand balling" with Arduino boards and ESP32 and .pdf report was the last step of designing a portable temperature datalogger for medical transportation. By the way i don't know why i got 2 downvotes, probably grammar?it was a bad question? Anyway maybe you can give some advice, i think creating a method for making pdf reports from plain text could be very useful since .pdf documents are a bit more complicated to modify than .txt or excel spreadsheets....Any way, thanks a lot, i was looking for something like this without succes, THANKS! – Diego Estrada Aug 23 '21 at 14:38
  • Thanks, i removed images that way i can reduce the code from 5k lines to about 1k, i find out that i can a append text to a table like,"BT /F2 6 Tf 224 749 Td (24/08/2021 13:24:00 13.55 14.66 13.44)Tj ( )Tj ET Q q" That way i'm adding a new line to a pre-defined table without corrupting the file or modifying the xref trailer, that way i can just add data to the report, until i get out of space in the page... not complete but i'm in the right direction. Thanks a lot. – Diego Estrada Aug 24 '21 at 16:06
  • Yes, i'm using Courier font, makes it easier, now i only deal with the coordinates of each line, with "for loop" i substract the height of each line until it gets to the bottom of the page, then with a "if condition" i start over right next to it at top of page and restart the "for loop" until i get out of space...I'm using a 2 page template and small fonts, that way i can add a lot of data without creating a new page. – Diego Estrada Aug 24 '21 at 16:17
  • Wow, each comment is gold, thanks KJ!! i will search more info on SVG images – Diego Estrada Aug 24 '21 at 16:20