0

I have written the following Hello code for a pdf file.

%PDF-1.4
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj
<< /Kids [3 0 R] /Count 1 >>
endobj
3 0 obj
<< /Parent 2 0 R /Contents 4 0 R >>
endobj
4 0 obj
<< /Length 20 >>
stream
BT
/F1 40 Tf
100 600 Td
(Hello!) Tj
ET
endstream
endobj
trailer
<<  /Root 1 0 R
    /Size 3
>>
%%EOF

I want to know how the xref table is calculated?\

UPDATE AFTER THIRD COMMENT:

Can I write a table as below?

xref
0 3
0000000000 65536 f
0000000001 00000 n
0000000002 00000 n
0000000003 00000 n

What is wrong with that (if any)?

The example in this page, shows that there are differences (greater than 1) between objects in xref. However, it is not clear why first object has offset 15 and second object has 87 offset. How these numbers are calculated?

mahmood
  • 23,197
  • 49
  • 147
  • 242
  • You mention ISO 32000. How about simply creating the cross reference table as described there? In contrast to some details of PDFs, the specification of cross reference tables therein is pretty straight forward and easy to understand. That being said, your file even with a cross reference table will be missing the page media box and the page resources (including the font **F1** you reference). – mkl Mar 29 '19 at 16:29
  • Section 7.5.4 explains xref. But I would like to know how the column values are calculated. I can not understand that! – mahmood Mar 29 '19 at 17:01
  • 1
    What exactly do you not understand? For **n** type entries you have first a "10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object", then a space, then a "5-digit generation number" (and in your PDF text all generation values are 0), then a space, then a "n", then a "2-character end-of-line sequence". – mkl Mar 29 '19 at 19:02
  • Question is not about how many columns are there in xref?!! I want to know that the offsets (first column) are calculated. See the updated post. – mahmood Mar 30 '19 at 17:30
  • Let me quote from my previous comment: "the number of bytes from the beginning of the file to the beginning of the object" – mkl Mar 31 '19 at 05:09

1 Answers1

1

After the edit of the question the problem became clear, you don't know the units in which the offset is measured.

The n entry in a xref table is described as

The format of an in-use entry shall be:

nnnnnnnnnn ggggg n eol

where:

nnnnnnnnnn shall be a 10-digit byte offset in the decoded stream

ggggg shall be a 5-digit generation number

n shall be a keyword identifying this as an in-use entry

eol shall be a 2-character end-of-line sequence

"10-digit byte offset in the decoded stream" might be a bit unclear. Fortunately the text above immediately is followed by an explanation:

The byte offset in the decoded stream shall be a 10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object.

(ISO 32000-1, section 7.5.4 "Cross-Reference Table")

Thus, the offset here really simply is the byte position in the PDF at which the object starts which its object and generation number.


As an aside, one item you must also follow strictly is the length of such an entry:

the overall length of the entry shall always be exactly 20 bytes.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • If you look at the tutorial I pasted, 15-0 has no relation wi 20 bytes no matters what radix you use. – mahmood Mar 31 '19 at 21:24
  • I say that the first line is `%PDF-1.4\n` whixh 10 characters and that means 10 bytes. So the first object starts with 0000000010. Right? – mahmood Mar 31 '19 at 21:26
  • It might be 9 or 10 bytes, depending on the line ending you chose. Thus, your first object starts at offset 9 or 10. It is common, though, to have a comment in the second line with some non-ASCII characters and the first object starting in the third line. Because of that, your tutorial has the first object at offset 15. – mkl Apr 01 '19 at 04:55
  • *"If you look at the tutorial I pasted, 15-0 has no relation wi 20 bytes no matters what radix you use"* - no one talked about '15-0'. The entry line is "0000000015 00000 n" which including the chosen line ending must be 20 bytes long. – mkl Apr 01 '19 at 04:59
  • @mahmood In the light of the clarifying comments here, does my answer appropriately answer your question? Or are there still open issues? – mkl Apr 03 '19 at 12:34
  • Do you confirm the following offsets? `0000000011 0000000056 0000000104` for the first three objects? – mahmood Apr 04 '19 at 17:32
  • Without the file itself (not just the file pasted as text into the question but the definitive binary file) I cannot confirm anything. – mkl Apr 04 '19 at 19:17
  • I have updated the code which is working on my computer. Just paste that in notepad and save as "sample.pdf". As you can see, it is working even without `xref`. But I think is out of the standard then. What do you mean by the binary file? – mahmood Apr 05 '19 at 17:50
  • I don't see it working. Adobe reader displays an empty page, and when closing the document again, it asks whether it should save the file. Adobe reader does so only if it had applied changes to the document. As your pdf does not contain form fields which might have been filled in, this indicates that Adobe reader first had to repair the document in memory to display something. – mkl Apr 06 '19 at 06:14
  • *"What do you mean by the binary file"* - I mean the file exactly add you have it stored, without the need to transform some copied&pasted text to a file (a process that might change result in very different files, depending on things like character encoding, line ending preferences, etc.). – mkl Apr 06 '19 at 06:18
  • Actually I was able to open that text file in Foxit. As I said, I saved the text file as sample.pdf and Foxit was able to open it. Can you send a sample file from what you are explaining? – mahmood Apr 06 '19 at 19:53
  • *"Actually I was able to open that text file in Foxit."* - That a file can be opened in a single PDF viewer and that it therein looks like you expect, hardly qualifies as a check for document validity. PDF viewers have a long history of trying to repair broken input files while loading them into memory. Your PDF **is** broken for numerous reasons (missing xref, missing font resource definition, miscellaneous missing types, ...). Probably the missing xref was allowed in Adobe's PDF 1.1 but nowadays PDF is specified by an ISO norm, so those ancient precursors are irrelevant. – mkl Apr 08 '19 at 12:50
  • You saved me!! "eol shall be a 2-character end-of-line sequence", I only had a one-char EOL, and Adobe Reader was complaining about something entierly different – wormsparty Mar 17 '21 at 08:02