Ghostscript as PDF Validator: Command and Results

Question

Two scholars said they used Ghostscript to validate PDFs. Their cryptic explanation of technique: "To make Ghostscript work as a validator, we simply converted the PDF files to 'None'." In a slideshow, they added that "None" was "a dummy result, no real output," and that converting to None "prints out found errors."

I would like to use Ghostscript in a similar manner, and would also like to learn a little about Ghostscript in the process, for future applications. My review of the Ghostscript documentation and of a previous StackOverflow answer has led me to try this (using Ghostscript Portable 9.50 in a Windows 7 virtual machine):

gswin64c.exe -o /dev/null -dNODISPLAY "C:\PDFs\Badfile.pdf" > "C:\Results.txt"

I welcome suggestions on whether that is the best command for the purpose. My questions here have to do with what Results.txt says about Badfile.pdf. Here are the contents of Results.txt:

GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
   **** However, the output may be incorrect.
   No pages will be processed (FirstPage > LastPage).

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

   **** The rendered output from this file may be incorrect.

My questions:

(1) Should I interpret this output as saying that the XREF table problem is the only problem in this file, or may there be other unspecified problems? If the latter, can I modify the command to obtain a more specific indication of what Results.txt means, when it reports that Badfile.pdf "does not conform to Adobe's published PDF specification"?

(2) "The file has been damaged. This may have been caused by a problem while converting or transfering the file." Is this suggesting that, for some flagged PDFs, the problems identified by Ghostscript may be due to Ghostscript itself?

(3) "Ghostscript will attempt to recover the data. ... This file had errors that were repaired or ignored." Can I assume that the operative word is "ignored" -- that, as in the procedure used by those two scholars, Ghostscript is not really attempting to recover data, and my command is producing "no real output"?

(4) For some purposes, I may want output in a one-line summary form. For instance, the JHOVE PDF validator's audit option can produce a line containing filename, MD5 hash, and a statement of whether the PDF file is valid. Given the scholars' finding that JHOVE has problems, it would be helpful if I could put Ghostscript's findings into a spreadsheet for comparison.

I realize Ghostscript may not have all this, and I appreciate what I already have from it. But if I am missing anything, I'd like to know. Thank you for any light you can shed.

Not an answer but some potential insight. Internally, a PDF file is like a book of chapters (each chapter being a text or graphics object etc) with an index. The book analogy breaks down because unlike a real-world book, the chapters can be in random order, or orphaned - only the index tells you the sequence of the valid chapters. Therefore, if the index (aka your XREF?) is broken then there is no way for a PDF reader (Ghostscript) to navigate and validate the content (chapters). So I would expect that your 'damaged' message means Ghostscript gave up and did not validate the further content. — Vanquished Wombat, Jan 15 '20 at 11:28

score 3 · Answer 1 · edited Dec 03 '22 at 19:19

The xref error is the first problem. GS attempts to fix that and continue. However the next error (FirstPage > LastPage) suggests it was unable to resolve the problem.

This is always going to be a problem; attempting to recover from a fault in the file might mean ignoring something important (or misinterpreting it) which leads the next object to error, and so on in a cascade.

Ghostscript isn't intended as a validation tool, while we have been reasonably diligent recently in flagging problems, earlier code might simply silently ignore them. In addition it was felt that repeated warnings were pointless, annoying, and made it hard to see the real fault, so many errors are only reported once, no matter how many occurrences here are.

So to answer your questions:

no this may not be the only error, it's just the first one encountered. There are no more verbose errors. You can use -dPDFDEBUG which dumps what the interpreter is up to as it goes and will localise some kinds of problems. The 'does not conform' is just boilerplate for 'something bad happened' in case there isn't a better error.
No this isn't suggesting Ghostscript broke it. It's giving 2 common reasons for PDF files to be broken; transferring via a non-binary mechanism (eg email) or one which does CR/LF translation, or editing the file.
It seems you don't know what's wrong with your file ? I can't see any reason why you would assume GS is ignoring the error, and in fact in the case of an xref problem it absolutely will not be ignoring it, it tried to fix it. Sadly the 'fixed' xref was clearly incorrect because it thinks there are no pages.
Not sure what the question is here; GS won't output a one-line summary. You could set -dPDFSTOPONERROR which will exit with an error code if there's a problem with the PDF file. It'll be a full PostScript error message though, not a single line.

Ghostscript as PDF Validator: Command and Results

1 Answers1