0

I'm reading PDF specs and I have a few questions about the structure it has.

First of all, the file signature is %PDF-n.m (8 bytes). After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?

After that, there should be a body, a xref table and a trailer and an %%EOF.

What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?

Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to? The first or the last xref table?

alexandernst
  • 14,352
  • 22
  • 97
  • 197
  • The second line cannot be any arbritary 'binary data' - it's just a comment line. With that said: you can parse it as any random comment line. – Jongware Jan 19 '16 at 21:21
  • @Jongware No, as far as I can read in the specs, those are actually at least 4 completely random bytes. – alexandernst Jan 19 '16 at 21:23
  • Not *completely* random - it should still be parseable! Adobe's own guide (I don't have the ISO-32000 on my iPad) says in 3.4.1 "a comment line containing at least four binary characters". Note the 'comment'; imagine your 1st character is a LF! (They also use 'binary' as a synonym of a character code >128 :P What they clearly mean is "with the highest bit set".) – Jongware Jan 19 '16 at 21:28
  • @Jongware Ah, yes, that's what I was also thinking of. Anyways, how would I know how many binary chars a PDF has? Is there any "breakpoint" I should be looking for? – alexandernst Jan 19 '16 at 21:30
  • Oh... I just realized... With a value > 128... I should just search for the first byte that has a < 128 value. That answers my first question I guess :) – alexandernst Jan 19 '16 at 21:31
  • They recommend a single comment line, right after the mandatory header line. A comment line starts with `%` and ends at a CR or LF. But there can be any number of comments in a PDF (all may or may not contain 'binary' characters; they all should end with a CR or LF). – Jongware Jan 19 '16 at 21:32

1 Answers1

2

First of all, the file signature is %PDF-n.m (8 bytes). After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?

Which docs do you have? The PDF specification ISO 32000-1 says:

If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater.

Thus, those at least 4 bytes of binary data are not immediately following the file signature without any structure but they are on a comment line! This implies that they are

  1. preceded by a % (which starts a comment, i.e. data you have to ignore while parsing anyways) and
  2. followed by an end-of-line, i.e. CR, LF, or CR LF.

So it is easy to recognize while parsing. In particular it merely is a special case of a comment line and nothing to treat specially.

(sigh, I just saw you and @Jongware cleared that in comments while I wrote this...)

What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?

If there are no objects, you don't have a PDF file as certain objects are required in a PDF file, in particular the catalog. So do you mean a minimal valid PDF file?

As you commented you indeed mean a minimal valid PDF.

Please have a look at the question What is the smallest possible valid PDF? on stackoverflow, there are some attempts to create minimal PDFs adhering more or less strictly to the specification. Reading e.g. @plinth's answer you will see stuff that is not PDF anymore but still accepted by Adobe Reader.

Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to?

Normally it would be the last cross reference table/stream as the usual use case is

  • you start with a PDF which has but one cross reference section;
  • you append an incremental update with a cross reference section pointing to the original as previous, and the new offset before %%EOF points to that new cross reference;
  • you append yet another incremental update with a cross reference section pointing to the cross references from the first update as previous, and the new offset before %%EOF points to that newest cross reference;
  • etc...

The exception is the case of linearized documents in which the offset before the %%EOF points to the initial cross references which in turn point to the section at the end of the file as previous. For details cf. Annex F of ISO 32000-1.

And as you can of course apply incremental updates to a linearized document, you can have mixed forms.

In general it is best for a parser to be able to parse any order of partial cross references. And don't forget, there are not only cross reference sections but also alternatively cross reference streams.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • %I'm not so sure OP understood it *must* be an otherwise perfectly regular comment... á§₩ü - here, a couple of 'binary' characters to show they *are* valid inside a comment. And it starts with `%` as well. – Jongware Jan 19 '16 at 21:35
  • Thank you for clarifying! The first part of the question should be covered now. – alexandernst Jan 19 '16 at 21:44
  • `So do you mean a minimal valid PDF file?` -> Yes, I mean the minimal possible content of a PDF file. `The exception is the case of linearized documents` -> Can you give me some more information about that case? – alexandernst Jan 19 '16 at 21:46
  • Ok, so there is only the minimal size question left. I also have another question, but that one is a whole another full-sized question, so I'll post another question. – alexandernst Jan 19 '16 at 22:12
  • I posted my next question: http://stackoverflow.com/questions/34888029/pdf-files-and-flash-readers – alexandernst Jan 19 '16 at 22:19