4

I am working on software to store legal documents and I was thinking that PDF might be an ideal format to work in. However I am a little confused as to what would best suit my needs in this regard in the actual format of the PDF file.

I have the following requirements for the documents:

  • will be stored for a minimum of 7 years if not longer
  • not editable
  • contain both images and text (images will be in .jpg format ideally)

I was originally looking at using PDF/A-1 however I have discovered that this format does not seem to like using JPEG images, or at least it doesn't when using JODConverter.

Any suggestions/explanations as to which format would best meet these needs would be much appreciated!

Matthew Pigram
  • 1,400
  • 3
  • 25
  • 65
  • Does your software need to convert legacy file formats into PDF/A? Or are you free to work from scratch and set up a system where only newly created documents will need to be archived through your software? – Kurt Pfeifle Aug 20 '12 at 07:45
  • some older documents will need to be converted into PDF format, but im only looking at supporting whatever Open Office supports, plus most docs from 7 years ago should (hopefully) convert into PDF format easily enough since people were using MS Word 7 years ago. Only documents still required to be kept (which is 7 years) will need to be converted to PDF format. – Matthew Pigram Aug 22 '12 at 06:42

2 Answers2

6

For the requirements you described, PDF/A-1b (yes, b at the end!) is the ideal format. The b is for basic -- it has less strict requirements to meet than the PDF/A-1a (a at the end), which is for accessible (or advanced, as I mnemonic it).

If you have no difficulty implementing PDF/A-1a, you may as well go for it. However, depending on your source documents, PDF/A-1a may be extremely difficult and nearly impossible to generate (as it requires the additional tagging of the file's content for the accessibility features).

As for JPEG: of course PDF/A-1b supports JPEGs. It does not allow JPEG2000 compression to be used, because that algorithm was patent encumbered at the time of defining the PDF/A-1b standard. PDF/A-1b generating software therefor must re-compress objects using this type of compression with one of the other methods (which does not pose a big practical problem though.)

You may also want to look at the The PDF/A Competence Center (PDFA) website. (Disclosure: I'm a member of the PDFA.)

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • are you familiar with JODConverter? I cant find any mention of how to convert my PDF's to the A-1b spec. Thanks for the info though, really good answer to my question! – Matthew Pigram Aug 20 '12 at 06:48
  • @Matthew Pigram: Since JODConverter relies on LibreOffice or OpenOffice: both have the capability to export to PDF/A-1a. However, for this to succeed (so that the end result *really* passes a PDF/A-1a validator) your source ODT documents needs to be properly formatted using good templates (so that the later tagging of the PDF works accordingly). – Kurt Pfeifle Aug 20 '12 at 08:05
  • Is PDF/A just rules to creation a PDF file or strong specification different to PDFv1.7? If I will create the PDF in accordance to the list of restrictions, will it be PDF/A or not? – Ruben Kazumov Aug 21 '12 at 05:50
  • @user15430083: A 'PDF/A' specification doesn't exist per se -- it's either 'PDF/A-1b' or 'PDF/A-1a' (not talking about the 'PDF/A-2\*' even...). And both are based on the general PDF-1.4 spec. They do *limit* features (and even forbid some) which are allowed in the general PDF-1.4 spec -- and they do forbid all new features that came in PDF-1.5, PDF-1.6 and PDF-1.7! – Kurt Pfeifle Aug 21 '12 at 07:22
  • @user1543083: A PDF/A-1b also requires a certain flag to be present in the PDF metadata declaring itself as PDF/A-1b. Otherwise it may meet all criteria, but wouldn't be recognized as such a file by viewers. – Kurt Pfeifle Aug 21 '12 at 07:25
  • Hi, Kurt! Is There some standard specification for file structure, flags etc? – Ruben Kazumov Aug 21 '12 at 22:37
  • @user1543083: I don't understand what you mean. Be more specific please. And follow the links I provided. You'll also have to do your own research... – Kurt Pfeifle Aug 21 '12 at 23:17
  • I think I have figured out why my code is breaking, what im doing is writing a whole bunch of PDF documents after an image is scanned in from the scanner, and then in the code im appending them together to create one multipaged document, since I have converted the indovidual pages as PDF/A-1 attempting to append them together is most likely a big "no no" since the PDF/A-1 format is designed purely for reading and not writing. – Matthew Pigram Aug 23 '12 at 04:52
  • @MatthewPigram: Do you have access to the official PDF/A-1b specification? If not, you can't hope to get it right "by accident"... If you make me available 2 (dummy) samples of your "PDF/A-1b" I offer you to scrutinize them and tell you which of the spec's items are not met: gimme one of the individual PDFs (before appending), and one of the appended ones. – Kurt Pfeifle Aug 23 '12 at 07:05
  • @KurtPfeifle I do have access to those documents, and I have checked these and tested with just a single jpg image, it doesn't work when being converted to PDF/A-1, there is no way through code to make it a PDF/A-1b through the library I'm using as far as I can tell, its probably the libraries fault and nothing to do with the spec, so I probably cant do anything to get it going... – Matthew Pigram Aug 24 '12 at 01:34
  • @MatthewPigram: I offered you to analyze one of your samples and tell you which PDF/A-1b-criteria aren't met... – Kurt Pfeifle Aug 24 '12 at 08:27
  • @KurtPfeifle Ok, well what would you want, the JPG image that im converting to a PDF or the completed PDF 1.6? – Matthew Pigram Aug 27 '12 at 00:48
  • @MatthewPigram: as I said: *'...2 (dummy) samples of your "PDF/A-1b"...'*. Because you had mentioned *'...I have converted the indovidual pages as PDF/A-1...'*. – Kurt Pfeifle Aug 27 '12 at 06:48
  • @KurtPfeifle I have since discovered that attempting to even convert just a single JPG image into a PDF/A-1b fails miserably, so those samples do not exists, the strange thing is that if I place the very same images into a word document and convert that to PDF/A-1b it works perfectly fine... – Matthew Pigram Aug 29 '12 at 01:51
1

PDF/A-1 is a good format for long-term storage (as that's it's intention) and so it tries to remove external dependencies. This includes some things like embedding fonts and DISABLING external hyperlinks (which makes sense also, but can be a gotcha). Some useful info is on the Adobe site (look at the key-specifications tab). PDF sounds like the right answer to your requirements.

The images being embedded should not be a problem. JODReports perhaps is doing something wrong (or the version of OpenOffice/LibreOffice you are using underneath). You could try switching parts of that underlying infrastructure (OO/LO), try experimenting directly from OpenOffice/LibreOffice GUI - export PDF/A-1 and see what the results are or try some other tools in the chain (eg Docmosis though that is based on similar technology).

Paul Jowett
  • 6,513
  • 2
  • 24
  • 19