1

Where do I find information about how a pdf is made up? For example: A pdf I created named Dokname containing the string TEST opend in a text-editor looks like this: (I replaced the parts the text-editor couldn't decode with [...])

%PDF-1.4
%Óëéá
1 0 obj
<</Title (Dokname)
/Producer (Skia/PDF m102 Google Docs Renderer)>>
endobj
3 0 obj
<</ca 1
/BM /Normal>>
endobj
5 0 obj
<</Filter /FlateDecode
/Length 160>> stream
[...]
endstream
endobj
2 0 obj
<</Type /Page
/Resources <</ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
/ExtGState <</G3 3 0 R>>
/Font <</F4 4 0 R>>>>
/MediaBox [0 0 596 842]
/Contents 5 0 R
/StructParents 0
/Parent 6 0 R>>
endobj
6 0 obj
<</Type /Pages
/Count 1
/Kids [2 0 R]>>
endobj
7 0 obj
<</Type /Catalog
/Pages 6 0 R>>
endobj
8 0 obj
<</Length1 14972
/Filter /FlateDecode
/Length 7164>> stream
[...]
endstream
endobj
9 0 obj
<</Type /FontDescriptor
/FontName /AAAAAA+ArialMT
/Flags 4
/Ascent 905.27344
/Descent -211.91406
/StemV 45.898438
/CapHeight 715.82031
/ItalicAngle 0
/FontBBox [-664.55078 -324.70703 2000 1005.85938]
/FontFile2 8 0 R>>
endobj
10 0 obj
<</Type /Font
/FontDescriptor 9 0 R
/BaseFont /AAAAAA+ArialMT
/Subtype /CIDFontType2
/CIDToGIDMap /Identity
/CIDSystemInfo <</Registry (Adobe)
/Ordering (Identity)
/Supplement 0>>
/W [0 [750] 40 54 666.99219 55 [610.83984]]
/DW 0>>
endobj
11 0 obj
<</Filter /FlateDecode
/Length 243>> stream
[...]
endstream
endobj
4 0 obj
<</Type /Font
/Subtype /Type0
/BaseFont /AAAAAA+ArialMT
/Encoding /Identity-H
/DescendantFonts [10 0 R]
/ToUnicode 11 0 R>>
endobj
xref
0 12
0000000000 65535 f 
0000000015 00000 n 
0000000365 00000 n 
0000000098 00000 n 
0000008721 00000 n 
0000000135 00000 n 
0000000573 00000 n 
0000000628 00000 n 
0000000675 00000 n 
0000007925 00000 n 
0000008159 00000 n 
0000008407 00000 n 
trailer
<</Size 12
/Root 7 0 R
/Info 1 0 R>>
startxref
8860
%%EOF

What do these obj-elements represent? Where is my TEST? Why did it get scrambled?

What I am searching for can probably all be found in adobe's documentations, but those have hundreds of pages which is very overwhelming. I get that this is a very complex topic and I am not trying to understand it completely. Just looking for an introduction or an overview. Unfontunately I didn't find anything like that on youtube or elsewhere..

  • Does this answer your question? [PDF specifications for coders: Adobe or ISO?](https://stackoverflow.com/questions/14111831/pdf-specifications-for-coders-adobe-or-iso) – Troy Turley Apr 13 '22 at 19:45

1 Answers1

1

Too complex for comments and yes you will only find snippets here and there including this and bits in my and others answers.

For a quick overview of the code sample you provided

A pdf is a collection of objects which are placed in no sequential order. So you start at the end before the last %%EOF (potentially one of many !) with startxref 8860 where 8860 is the decimal address of the Cross(XRef)erence table i.e. the files index.

There are many abbreviations (too many to list) and like a stack language most things may appear (literally) backwards so the xref points to each objects position in the file.

The prime target in this case is 7 0 obj <</Type /Catalog /Pages 6 0 R>> endobj since the catalog tells us about where the number of following pages will be found thus in object 6 /Pages /Count 1 /Kids [2 0 R] so its one page further defined in 2 0 obj

We now see there is an image and font(s) placed within /MediaBox [0 0 596 842] which is roughly (a tad wider) than a standard A4 page since 595/72" is closer to 210 mm.

Too much to describe about that one item alone, so skipping to Where is your text? and we see /Contents 5 0 R so that compressed stream of data that you need to decode is most likely your text but the length (/Length 160) is the binary flate encoded stream with placements not just your raw plain text.

The quantity of date sub setting the font seems odd and excessive for just 4 letters (if it was similar Helvetica it would not need including nor breaking the font as CID ArialMT) and without the full file its hard to say why the words /Image* is there, but it is Google Docs Renderer!

My suspicion is we may see characteristics of OCR in that stream.

K J
  • 8,045
  • 3
  • 14
  • 36