1

I am trying to understand the overall structure of a PDF file. To start, I would like to know how to parse a PDF for text that it contains using only Python's standard library. I have found a good resource here for the structure of a PDF file, but right now, it's way out of my league: PDF Documentation

For practice, I have created a PDF that contains only the text, "Hello World" as shown in the image below.

enter image description here

How can I find this from the binary data of the pdf alone? Starting from here:

with open('Hello World.pdf', 'rb') as f:
    data = f.read()

How can I locate the "Hello World" text? I wish I could include the data here, but there are too many characters.

Gabe Morris
  • 804
  • 4
  • 21
  • 1
    *"To start, I would like to know how to parse a PDF for text that it contains"* - please be aware that this is one of the most complicated facets of pdf content processing if done properly. (There are some ways to cheat for some simple documents but a generic solution is complex.) – mkl Jun 23 '21 at 06:13
  • I think your comment is exactly what I'm looking for. So the stream data is the binary containing the "Hello World!" text? How can I decode that because it's not UTF-8 correct? How do map from the hex to the actual characters? These are the questions I'm looking for. @KJ – Gabe Morris Jun 23 '21 at 17:24
  • Do you think you can post an answer with the code you used? How did you decrypt it? The PDF files that I plan on parsing are all structured the exact same, so I'm curious as to how to find certain objects. I'm thinking about applying a regex, but I don't know how to do that with binary strings. But I have zero experience when it comes to decoding hex values into actual characters. Could you shed some light on that for me in an answer? Thanks! @KJ – Gabe Morris Jun 23 '21 at 20:01
  • I do recognize the PDF format for string objects from what you decrypted. If you look at the pdf specs from the link I provided, it makes sense. If I can get to where you’ve gotten then I’m a happy person. @KJ – Gabe Morris Jun 23 '21 at 20:12

2 Answers2

0

You might want to try a library such as PyPDF2 or tika

from tika import parser # pip install tika

raw = parser.from_file('hello_world.pdf')
print(raw['content'])

for more information check here How to extract text from a PDF file?

Jonathan Coletti
  • 448
  • 4
  • 13
  • I would like to accomplish this with only the standard library like I said in the post. – Gabe Morris Jun 23 '21 at 02:35
  • @GabeMorris then check out the source code for something like tika.... I was just trying to show some libraries so you can look at them. Your question is asking someone to write a full python module... – Jonathan Coletti Jun 23 '21 at 02:36
0

In a comment you asked how to easily build and analyse a basic PDF I used elsewhere a simple input technique to build an example but am repeating here since that answer was recently deleted !

It includes an A4 image (595 pixels wide) so just omit that first %Set block.

 %% is a global definition
 % is a comment
 0,0 is page bottom Left (x,y)

One Page.Txt (its not exactly the same as picture, since I now corrected Proper names :-)

%%MediaBox 0 0 595 842
%%Font Helv Helvetica
%%Image I0 background.png

% Set the Background image.
q
595 0 0 842 0 0 cm
/I0 Do
Q

% Add text.
q
0 0 1 rg
BT /Helv 18 Tf 50 805 Td (Hello, World!) Tj ET
BT /Helv 18 Tf 50 777 Td (Hello, Moon!) Tj ET
Q

Run Artifex MuPDF MuTool (cross platform AGPL)

> mutool create -o sample.pdf Page.Txt

enter image description here

You were more interested in how I was able to extract the exact content of the resulting PDF. Remember any extractions of a PDF, just means objects are a re-expression of the former file content they can be out of order, it is NOT analysis nor reverse engineered, it is a reimaging.

For that task I used MuPDF-GL.exe on Windows to open the file and using A Save new PDF

  1. Removed encryption and
  2. Switched output to Pretty print + Ascii (AKA Plain text), then saved as a new text based PDF
K J
  • 8,045
  • 3
  • 14
  • 36