Two scholars said they used Ghostscript to validate PDFs. Their cryptic explanation of technique: "To make Ghostscript work as a validator, we simply converted the PDF files to 'None'." In a slideshow, they added that "None" was "a dummy result, no real output," and that converting to None "prints out found errors."
I would like to use Ghostscript in a similar manner, and would also like to learn a little about Ghostscript in the process, for future applications. My review of the Ghostscript documentation and of a previous StackOverflow answer has led me to try this (using Ghostscript Portable 9.50 in a Windows 7 virtual machine):
gswin64c.exe -o /dev/null -dNODISPLAY "C:\PDFs\Badfile.pdf" > "C:\Results.txt"
I welcome suggestions on whether that is the best command for the purpose. My questions here have to do with what Results.txt says about Badfile.pdf. Here are the contents of Results.txt:
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
No pages will be processed (FirstPage > LastPage).
**** This file had errors that were repaired or ignored.
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
**** The rendered output from this file may be incorrect.
My questions:
(1) Should I interpret this output as saying that the XREF table problem is the only problem in this file, or may there be other unspecified problems? If the latter, can I modify the command to obtain a more specific indication of what Results.txt means, when it reports that Badfile.pdf "does not conform to Adobe's published PDF specification"?
(2) "The file has been damaged. This may have been caused by a problem while converting or transfering the file." Is this suggesting that, for some flagged PDFs, the problems identified by Ghostscript may be due to Ghostscript itself?
(3) "Ghostscript will attempt to recover the data. ... This file had errors that were repaired or ignored." Can I assume that the operative word is "ignored" -- that, as in the procedure used by those two scholars, Ghostscript is not really attempting to recover data, and my command is producing "no real output"?
(4) For some purposes, I may want output in a one-line summary form. For instance, the JHOVE PDF validator's audit option can produce a line containing filename, MD5 hash, and a statement of whether the PDF file is valid. Given the scholars' finding that JHOVE has problems, it would be helpful if I could put Ghostscript's findings into a spreadsheet for comparison.
I realize Ghostscript may not have all this, and I appreciate what I already have from it. But if I am missing anything, I'd like to know. Thank you for any light you can shed.