I am trying to understand the overall structure of a PDF file. To start, I would like to know how to parse a PDF for text that it contains using only Python's standard library. I have found a good resource here for the structure of a PDF file, but right now, it's way out of my league: PDF Documentation
For practice, I have created a PDF that contains only the text, "Hello World" as shown in the image below.
How can I find this from the binary data of the pdf alone? Starting from here:
with open('Hello World.pdf', 'rb') as f:
data = f.read()
How can I locate the "Hello World" text? I wish I could include the data here, but there are too many characters.