Merging PDFs on memory and saving in PDF/a format with ghostscript

Question

I try to merge two pdfs with PyPDF2 in memory and save the resulting pdf in the PDF/a format. Convert a Pdf to PDF/a with Ghostscript works, but only with two paths from hard disk not from memory. It doesn't work with the merged PDF on memory

The following code produces the error:

import os, subprocess
from PyPDF2 import PdfFileMerger
from io import BytesIO

def convertPDF2PDFA(sourceFile, targetFile):
    ghostScriptExec = ['gs', '-dPDFA=2', '-dBATCH', '-dNOPAUSE', '-sProcessColorModel=DeviceCMYK',
                       '-sDEVICE=pdfwrite', '-dPDFACompatibilityPolicy=3']
    cwd = os.getcwd()
    os.chdir(os.path.dirname(targetFile))
    try:
        subprocess.check_output(ghostScriptExec +
                                ['-sOutputFile=' + os.path.basename(targetFile), sourceFile])
    except subprocess.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))
    os.chdir(cwd)

path1 = 'path1'
path2 = 'path2'
save_path = 'result_path'

paths = [path1, path2]
merger = PdfFileMerger()
tmp = BytesIO()

for path in paths:
    merger.append(path, import_bookmarks=False)

merger.write(tmp)

convertPDF2PDFA(tmp.getvalue(), save_path)

The error I get is ValueError: embedded null byte.

Edit: I changed the ghost script parameter to:

ghostScriptExec = ['gs', '-dPDFA=2', '-dBATCH', '-dNOPAUSE', '-dNOSAFER', '-sProcessColorModel=DeviceRGB',
                   '-sDEVICE=pdfwrite', '-dPDFACompatibilityPolicy=2']

I also added PDFA_def.ps, in which I included the AdobeRGB1998.icc.

def convertPDF2PDFA(sourceFile, targetFile):
    cwd = os.getcwd()
    os.chdir(os.path.dirname(targetFile))
    pdfa_def_path = '/Users/mazze/Desktop/PDFA_def.ps'
    try:
        subprocess.check_output(ghostScriptExec +
                            ['-sOutputFile=' + os.path.basename(targetFile) , pdfa_def_path, sourceFile])
    except subprocess.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))
    os.chdir(cwd)

Converting PDF to PDF/a works or at least Adobe always confirms pdf/a format. However, the whole thing only works if I first save the merged file and later use the path in the convertPDF2PDFA and save the whole thing under a new path. Do I always have to take this extra step?

That isn't a Ghostscript error. Ghostscript (more properly Ghostscript's pdfwrite device) doesn't merge files, it creates a new file which should be visually the same. The actual PDF operations will be quite different. Ghostscript only works with files on disk because you must be able to seek in a PDF file, and there's no simple cross-platform way to hand a buffer of memory to a separate process. Your PDF/A creation isn't adequate since you aren't running pdf_def.ps (or a suitably modified version thereof) — KenS, May 04 '22 at 18:19
Does this mean that I have to save the merged PDF first, then read it back in with Ghostscript and convert it to PDF/a? When I run the function `convertPDF2PDFA` with two files from disk, I get a PDF in PDF/a format — Mazze, May 04 '22 at 20:22
I tried this command line `gs -dPDFA=2 -dBATCH -dNOPAUSE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o /Users/mazze/Desktop/test1.pdf /Users/mazze/Desktop/PDFA_def.ps -dPDFACompatibilityPolicy=1 /Users/mazze/Desktop/1.pdf`. The `PDFA_def.ps` I got from the this [post](https://stackoverflow.com/questions/1659147/how-to-use-ghostscript-to-convert-pdf-to-pdf-a-or-pdf-x). In the `PDFA_def.ps` I included the `AdobeRGB1998.icc`. But I get the error `Setting Overprint Mode to 1 not permitted in PDF/A-2, reverting to normal PDF output` — Mazze, May 05 '22 at 11:09
OK and ? The error is telling you that the graphics operation (overprint) isn't valid in a PDF/A-2 file. You should have got a message that the mode was ignored. Try moving -dPDFACompatibilityPolicy=1 **before** pdfa_def.ps so that it is set before you start interpreting programs. Also check the spelling. — KenS, May 06 '22 at 07:51
No converting the PDF works (or at least that's what Adobe tells me). Nevertheless I still have to save the merged PDF from PyPdf2 to disk before applying it Ghostscript. @KenS from your first answer I am not sure if this works at all — Mazze, May 12 '22 at 18:21
Ghostscript can create a new PDF from two source PDF files (provided they are both on disk). That output PDF file can further be made to be PDF/A compliant. I cannot tell you what you're doing wrong without an example file (or files, including pdfa_def.ps) and a command line. You cannot give Ghostscript a memory buffer, but you can send the input via stdin (and Ghostscript will write it to a temporary file, so no actual saving). The output **can** be sent to stdout, but I don't recommend it as some features require a file (for seeking). — KenS, May 13 '22 at 07:11
Oh, and I wouldn't trust Adobe Acrobat's PDF/A verification. Try VeraPDF to check if files are valid. — KenS, May 13 '22 at 07:12

Merging PDFs on memory and saving in PDF/a format with ghostscript

0 Answers0