1

Not sure where else to ask this, so I figured I'd give good old stackoverflow a shot.

Let's say, by chance, I would like to write a library or set of libraries that will create PDF's and convert files to PDF, AND I could care less about how long it will take me to complete (3 months - 10 years.. whatever). I have absolutely no interest in paying for a toolkit... the point of this would be to learn how to manipulate and create files like PDF's. There's nothing business critical about the project, I just want to learn how to do it. Where do I start? I would imagine something like this would be written in C++, but I'm not sure... maybe high level languages would work as well. I'm not looking for someone to tell me exactly how to do it, but send me in the write direction, or at least point out the concepts I would need to concretely grasp before proceeding with such a project.

Any advice and help in directing me here is greatly appreciated : )

wakurth
  • 1,644
  • 1
  • 23
  • 39
  • 3
    Good luck: < http://www.adobe.com/devnet/pdf/pdf_reference.html >. As far as language: if the purpose of this is to learn the PDF format, go with whatever language is most comfortable. A high-level language would probably be a good idea: that way, you'd be more likely spending time actually doing PDF parsing, not wrangling the language. – John Calsbeek Feb 11 '12 at 05:17

1 Answers1

4

Well, you will need a very good understanding of the PDF file format. Adobe publishes the standard and you can start at their site. You can start with the base 1.7 standard and then read the cumulative supplements from there. It is a daunting task, but it can be done and you can pretty much use any language you want, because in the end you are just generating bytes that can be saved to a file.

If you want to convert from, let's say, word documents, it will get a little trickier. Microsoft has published their file formats, which you would have to learn and then learn how to translate that into the corresponding PDF formatting. Also note that the .doc and .docx formats are completely separate file formats and would require separate engines to convert them.

With unlimited time, it is definitely doable, you would just need to ask yourself if it is worth it.

John Koerner
  • 37,428
  • 8
  • 84
  • 134
  • Thanks for the insight!!! I can't tell you how many times I've run into issues where I needed to do file converstion (PDF-WORD, WORD-PDF, TIFF-PDF, JPG-PDF, so on, so on, and so on..) and as you probably know, most file conversion programs that deal with PDF - "many" different file formats are thousands and thousands of dollars... and for good reason after glancing over the links I've seen in this thread so far. I love learning new concepts, and this is always something I've wanted to at least touch on... – wakurth Feb 12 '12 at 05:53
  • Follow up question for you John... Do you know of any direct ways to convert PDF to XPS that would be safe to use on the backend of a web server... the PDF to XPS and XPS to PDF one is actually the one I'm most interested in (but I just forgot to specify that above)... I would think this conversion may be a bit easier.. but not sure.. Thanks again John. – wakurth Feb 12 '12 at 05:56
  • hrmm.. seems like this may be a temporary option http://stackoverflow.com/questions/1002476/best-ways-to-convert-xps-to-pdf-and-vice-versa ... while I figure out how to do it myself. – wakurth Feb 12 '12 at 06:36