79

I want to code an application that can read and decode a pdf document; now where I'm supposed to get the specs for this fileformat ? The PDF format is standardized from the ISO group but it's not clear to me where is the most reliable source for getting this kind of informations.

what is a good source to start with this file format ?

user1824407
  • 4,401
  • 4
  • 15
  • 20
  • Depending on what you need, maybe you can grab an existing library to do the work for you? – Thomas Jan 01 '13 at 16:01
  • 3
    @Thomas you know a C/C++ lib that is tiny and compact ? there are many horrible libraries ... – user1824407 Jan 01 '13 at 16:03
  • No first-hand experience, sorry. But I reckon it'll take you at least a few months of fulltime work to do better. – Thomas Jan 01 '13 at 16:04
  • What platform are you developing for? If MacOSX or iOS this is trivial, otherwise - as already suggested - this a a lot of work. – marko Jan 01 '13 at 16:16
  • 1
    @Marko my point is not to target a platform but to target the pdf itself, i was planning to code this with standard and portable C++ or C ANSI – user1824407 Jan 01 '13 at 16:24
  • 1
    In response to your original question: use the ISO 32000-1:2008 standard as reference and keep an eye on the development of ISO 32000-2. If you want to do Adobe specific development, additionally have a look at the Adobe supplements to the norm. – mkl Jan 01 '13 at 19:57
  • 20
    @all_those_who_vote_to_close_question_as_not_constructive (Daniel Fischer, Flexo, C. Ross, Bo Persson, Matthew Lundberg): this question was a very valid one, and closing it was in fact very unconstructive. And the justification you picked for it is ridiculous. None of you has so far collected any significant reputation points with [pdf] -- so why do you not leave this question to people who know the topic better than you?!? – Kurt Pfeifle Jan 02 '13 at 02:52
  • @KurtPfeifle "Where do I get the specs" is not a specific programming problem. So it's off topic on SO. Not all valid questions are on topic here. (That the question was closed as Not Constructive doesn't mean all of us picked that reason. However, shopping list questions - and this can also be viewed as one without too much of a stretch - are generally considered NC, so it's not ridiculous, though IMO not the best reason.) – Daniel Fischer Jan 04 '13 at 05:10
  • 4
    @DanielFischer: SO is also for *newbie* developers. And even the most senior C# or Haskell expert may happen to get tasked for his first time to create software that has to handle PDF documents. So where else but to SO should he turn to in order to ask such a question? Please? – Kurt Pfeifle Jan 04 '13 at 11:03
  • 2
    @DanielFischer: Also, this question does not match either of the following exclusion criteria from the FAQ: (a) *chatty, open-ended question*; (b) *unpractical, unanswerable question*; (c) *question is not about a problem the programmer faces*. -- OTOH, it meats the following inclusion criteria: (a) *practical, answerable problem unique to the programming profession*; (b) *(software) tool commonly used by (PDF) programmers*. DanielFischer: please re-consider your vote. – Kurt Pfeifle Jan 04 '13 at 11:11
  • 1
    @DanielFischer: re. your point about *'shopping list'*... It's not that there are dozens of different versions of the PDF file format specification by different competing vendors, ya know? The question was about two different versions of the 'same thing' (which in the case of PDF may indeed be very confusing to newbie PDF programmers). – Kurt Pfeifle Jan 04 '13 at 11:14
  • @KurtPfeifle I don't follow. Newbieness has nothing to do with it. "Where do I get the specs" is off topic on SO, regardless of whether it's the PDF specification, the C standard, or the Haskell report (or whatever). So if I can, I post a helpful comment, and vote to close. – Daniel Fischer Jan 04 '13 at 11:15
  • @KurtPfeifle There's only one specification (I think), but there may be dozens of places where one may get it. (But I agree, it's not a question I would consider a shopping list question.) However, getting the specification is not a _programming_ problem (those start soon after). – Daniel Fischer Jan 04 '13 at 11:21
  • 2
    @DanielFischer: Well, (even if I would accept your argument) in this case you didn't post a helpful comment but voted to close it anyway. :-( - – Kurt Pfeifle Jan 04 '13 at 11:21
  • @KurtPfeifle a) I don't know where to get the PDF spec, b) David van Driessche had already answered that. – Daniel Fischer Jan 04 '13 at 11:22
  • 1
    @DanielFischer: In a previous comment you rate the question as a *"shopping list question without too much of a stretch"*. In your last comment you say *" it's not a question I would consider a shopping list question"*. Ok, I know now clearly what your arguments are... – Kurt Pfeifle Jan 04 '13 at 11:25
  • @KurtPfeifle But convince me. What is so good about this question that it needs to be open on SO? – Daniel Fischer Jan 04 '13 at 11:25
  • @KurtPfeifle _without too much of a stretch_ means I see how one can consider it such, not that I do consider it such. – Daniel Fischer Jan 04 '13 at 11:26
  • 2
    @DanielFischer: Yes, David van Driessche (who is a real PDF guru, BTW) has answered the question. But he could only do so while the question was not yet closed... – Kurt Pfeifle Jan 04 '13 at 11:26
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/22176/discussion-between-daniel-fischer-and-kurt-pfeifle) – Daniel Fischer Jan 04 '13 at 11:28
  • 14
    I thought this was a really good question. Yes, SO is about development questions, but the person asking this question was actually smart enough to inquire about the framework around his development rather than blundering into things and asking 10 stupid questions on SO because they didn't think before starting. Lets not forget that development starts by doing your research; I would actually award extra points to someone who thinks before doing. Questions about the fundaments of your development for me definitely qualify as development questions... Oh, and it's a *smart* question! – David van Driessche Jan 05 '13 at 13:42

3 Answers3

69

You can actually use both sources you mentioned; the confusion is historical.

Adobe invented PDF and it invented the Acrobat product family to be used together with it. The different PDF versions were released together with major Acrobat versions (PDF 1.3 for example was released together with Acrobat 4).

Because of the adoption of the PDF format and because a number of ISO standards were written that actually depended on the proprietary PDF file format (not an easy thing for an ISO standard), Adobe decided to hand over the PDF format to ISO.

From that point on and until today there is an ISO committee responsible for editing the PDF specification and coming up with new versions. The ISO standard for PDF is ISO 32000.

Also, keep in mind that, depending on where you want to use PDF, a number of other ISO standards might be very useful or indispensable. Amongst the most commonly used are PDF/X (for exchange of PDF files in the publishing community) and PDF/A (for the creation of PDF files that need to be archived in long-term storage). These specifications reference a specific version of the PDF standard and add additional requirements and restrictions.

As far as the specification is concerned, you can get all documents from the ISO directly. However, for PDF itself you can also get it from Adobe and that document will be identical. Refer to the Adobe DevNet site on Acrobat:

http://www.adobe.com/devnet/acrobat.html

Just download the Acrobat SDK and that will give you the documentation as part of it.

Let me add a word of caution on "targeting the PDF specification" in code. I really, really, really advise you to more clearly specify exactly what your needs are for PDF (editing, generating, quality control (preflight)) and then look for or ask about an existing library that meets those needs or can be extended to meet your needs.

Writing something that supports "PDF" in general will be a daunting task. The PDF specification is large, intricate and full of... well... niceties. There be dragons!


Update:

Direct link to Adobe's PDF-1.7 specification document (first edition, free to download, is here:

The content of this document later became officially adopted as the ISO standard for general PDF, ISO 32000-1.

Note however, that there are a few differences to the PDF file available from ISO:

If you start developing PDF software, it is sufficient to have (free) PDF from above Adobe link around.


Update: 2021

It's worth noting that ISO meanwhile released a new version of the PDF specification, called ISO 32000-2. Information about this on the ISO site. This new version was published in 2017 and received an update in December 2020.

While the document does not dramatically alter PDF, and most of the general information about PDF from for example the free Adobe version of the specification will still be correct, there are definitely changes:

  • Many things, especially deeply technical things such as everything on transparency, received an update, mostly to clarify existing language (and add information that was up to now more or less implicit). These updates may have an effect on how to implement or use those parts of the standard.
  • New features have been included in the standard.

If you're writing PDF files, especially more simple ones, the Adobe specification should still be OK to get you going. If you want to support everything in the PDF standard, you'll need to pay for the latest ISO version (but that is a tall order anyway).

Update: 2023

As of April 2023, the PDF 2.0 ISO standard is available for no cost for everyone, thanks to some generous sponsors. This also includes ISO-approved errata and new PDF 2.0 cryptographic extensions. See https://www.pdfa.org/announcing-no-cost-access-to-iso-32000-2-pdf-2-0/ to get your own copy.

ISO 32000-2 is the first PDF specification entirely developed in a vendor-neutral, consensus-based forum. Many corrections and clarifications were made that can help every PDF user to ensure reliability and interoperability.

David van Driessche
  • 6,602
  • 2
  • 28
  • 41
  • 2
    The link to the free Adobe spec is broken. This one currently works: https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf Unfortunately, I cannot edit the post because the Stackoverflow queue for edits on this post is full. – Steffen Langer Jan 20 '23 at 12:34
4

As of today (April 2023) the PDF 2.0 ISO standard is available for no cost for everyone, thanks to some generous sponsors. This also includes ISO-approved errata and new PDF 2.0 cryptographic extensions. See https://www.pdfa.org/announcing-no-cost-access-to-iso-32000-2-pdf-2-0/ to get your own copy.

ISO 32000-2 is the first PDF specification entirely developed in a vendor-neutral, consensus-based forum. Many corrections and clarifications were made that can help every PDF user to ensure reliability and interoperability so please stop using legacy versions that are now well over a decade old.

pwyatt
  • 41
  • 2
3

PDF is not a lightweight format. It is basically postscript with compression on top. An existing library is definitely what you want to use, not write your own. It's a huge task.

Or get an existing PDF writer application, and start it from within your program.

I haven't looked at it very much, but libgnupdf looks OK.

According to Wikipedia PDF combines three technologies:

  • A subset of the PostScript page description programming language, for generating the layout and graphics.
  • A font-embedding/replacement system to allow fonts to travel with the documents.
  • A structured storage system to bundle these elements and any associated content into a single file, with data compression where appropriate.
Palec
  • 12,743
  • 8
  • 69
  • 138
Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • 1
    for what i know the PDF file format can contain a lot of other things that are not really that PS alike such as bitmap images and videos. PS is a language for printers basically that was redictered to monitors, but i think that PDF is more like a container. – user1824407 Jan 01 '13 at 16:28
  • 2
    also that gnupdf is GPLv3 so it's useless. – user1824407 Jan 01 '13 at 16:29
  • If you don't want an open source library, then perhaps this would work: http://www.adobe.com/devnet/pdf/library.html I've just updated the answer itself with information regarding "it is not postscript". If you think that's wrong, then please point me to a more accurate provider of information. – Mats Petersson Jan 01 '13 at 16:35
  • 8
    the problem it's not that is opensource ( actually it's a +1 ) the problem is the GPL and its viral license. – user1824407 Jan 01 '13 at 16:38
  • Ok, there are others around... http://pdf-house.blogspot.co.uk/ seems to have a decent list. – Mats Petersson Jan 01 '13 at 16:50
  • 7
    There are actually HUGE differences between PostScript and PDF. Calling PostScript PDF with compression on top is not giving much credit to either language. The biggest difference perhaps is that PostScript is a true programming language, while PDF is not. That is an enormous difference and explains why PDF is the format used (there are clearly other reasons too) today while PostScript is going away. – David van Driessche Jan 01 '13 at 23:04
  • Perhaps sloppy use of language - it is BASED on PostScript, and you need a pretty good part of a PostScript interpreter to implement PDF rendering - at least that is my understanding, if you can explain it better, feel free to write a better answer. – Mats Petersson Jan 01 '13 at 23:06
  • 4
    No, you actually don't. The problem with PostScript is that it is in fact a programming language and you need code that will execute the program (that every PostScript file is) and allow it to generate its output. PDF is much, much simpler and contains only very simple instructions like "move text origin", "draw rectangle", "set fill color", "set font"... It's a really a different animal altogether. (and I just finished writing a background answer on the specification that should help him :)) – David van Driessche Jan 01 '13 at 23:15
  • OK, fair enough. Wikipedia has it wrong. But we both agree on the fact that it's a pretty poor idea to try to implement your own PDF code, which is the main point of what I write. – Mats Petersson Jan 01 '13 at 23:19
  • 2
    @MatsPetersson: Please do not answer a technical question on StackOverflow by (badly) quoting or referring to Wikipedia. Nothing good in favor of your reputation can come from this... I'd even like to ask you to delete your above answer -- it is so utterly wrong and can lead people down the wrong path! – Kurt Pfeifle Jan 02 '13 at 02:44