1

I have a successfully implemented a pdf merge solution using PDFBox using InputStreams. However, when I try to merge a document that is of a very large size I receive the following error:

Caused by: java.io.IOException: Missing root object specification in trailer.
at org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2832) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:173) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1060) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:379) ~[pdfbox-2.0.11.jar:2.0.11]
at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:280) ~[pdfbox-2.0.11.jar:2.0.11]

Of more importance (I think) are these statements that occur just before the error:

FINE (pdfparser.COSParser) [] - Missing end of file marker '%%EOF'
FINE (pdfparser.COSParser) [] - Set missing offset 388 for object 2 0 R

It seems to me that it can't find the '%%EOF' marker in very large files. Now I know that it is indeed there because I can look at the source (unfortunately I can't provide the file itself).

Doing some searching online I found that there is a setEOFLookupRange() method on the COSParser class. I'm wondering if perhaps the lookup range is too small and that is why it can't find the '%%EOF' marker. The problem is...I'm not using the COSParser object at all in my code; I'm only using the PDFMergerUtility class. The PDFMergerUtility seems to be using the COSParser under the hood.

So my questions are

  1. Is my hypothesis about the EOFLookupRange correct?
  2. If so, how can I set that range only having the PDFMergerUtility in my code and not the COSParser object?

Many thanks for your time!

UPDATED with code below

 private boolean getCoolDocuments(final String slateId, final String filePathAndName)
            throws IOException {

        boolean status = false;
        InputStream pdfStream = null;
        HttpURLConnection connection = null;
        final PDFMergerUtility merger = new PDFMergerUtility();
        final ByteArrayOutputStream mergedPdfOutputStream = new ByteArrayOutputStream();

        try {

            final List<SlateDocument> parsedSlateDocuments = this.getSpecificDocumentsFromSlate(slateId);

            if (!parsedSlateDocuments.isEmpty()) {

                // iterate through each document, adding each pdf stream to the merger utility
                int numberOfDocuments = 0;
                for (final SlateDocument slateDocument : parsedSlateDocuments) {

                    final String url = this.getBaseURL() + "/slate/" + slateId + "/documents/"
                            + slateDocument.getDocumentId();

                     /* code for RequestResponseUtil.initializeRequest(...) below */
                    connection = RequestResponseUtil.initializeRequest(url, "GET", this.getAuthenticationHeader(),
                            true, MediaType.APPLICATION_PDF_VALUE);

                    if (RequestResponseUtil.isSuccessful(connection.getResponseCode())) {
                        pdfStream = connection.getInputStream();

                    }
                    else {
                        /* do various things */
                    }

                    merger.addSource(pdfStream);
                    numberOfDocuments++;
                }

                merger.setDestinationStream(mergedPdfOutputStream);

                // merge the all the pdf streams together
               merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());

               status = true;
            }
            else {
                LOG.severe("An error occurred while parsing the slated documents; no documents remain after parsing!");
            }
        }
        finally {
            RequestResponseUtil.close(pdfStream);

            this.disconnect(connection);
        }

        return status;
    }

   public static HttpURLConnection initializeRequest(final String url, final String method,
            final String httpAuthHeader, final boolean multiPartFormData, final String reponseType) {

    HttpURLConnection conn = null;

    try {
        conn = (HttpURLConnection) new URL(url).openConnection();
        conn.setRequestMethod(method);
        conn.setRequestProperty("X-Slater-Authentication", httpAuthHeader);
        conn.setRequestProperty("Accept", reponseType);
        if (multiPartFormData) {
            conn.setRequestProperty("Content-Type", "multipart/form-data; boundary=BOUNDARY");
            conn.setDoOutput(true);
        }
        else {
            conn.setRequestProperty("Content-Type", "application/xml");
        }
    }
    catch (final MalformedURLException e) {
        throw new CustomException(e);
    }
    catch (final IOException e) {
        throw new CustomException(e);
    }
    return conn;

}
risingTide
  • 1,754
  • 7
  • 31
  • 60
  • 1
    The "Missing root object specification in trailer" usually happens when the file is truncated. One %%EOF should be at the end of the file (but there can be more in the middle). If there isn't an %%EOF at the end of the file, then you should find out why. – Tilman Hausherr Jul 28 '18 at 04:11
  • Even stricter: it's not merely *recommended* to have an %%EOF at the end of pdf files, it is *required*! Any pdf file without that is broken. Thus, if you have to change the EOFLookupRange to load the pdf, you might be in for some surprise concerning the contents of the pdf. – mkl Jul 28 '18 at 15:10
  • @TilmanHausherr & @mkl - There are actually 3 `%%EOF` markers in the large file in question. One of them is at the very end and two are in the middle of the document. However, I have other smaller files that also have 3 `%%EOF` in the same positions that merge just fine. – risingTide Jul 30 '18 at 12:37
  • 1
    Could it be that there are null bytes that you didn't see? If you haven't try looking at the file with notepad++. – Tilman Hausherr Jul 30 '18 at 12:40
  • Null bytes? What do you mean by that? – risingTide Jul 30 '18 at 12:41
  • @TilmanHausherr - If you're referring to `0x00` then no, I do not have any of those in my file; I did check with NotePad++. – risingTide Jul 30 '18 at 17:39
  • 2
    Now comes the time where you either share the file, or debug this with the source code around that "Missing end of file marker" message. That is in `COSParser.getStartxrefOffset()`. – Tilman Hausherr Jul 30 '18 at 18:06
  • I really wish I could share it, but I can't. However, I took another approach and may be on to something. As posted above I'm working with the `InputStream` solution (`PDFMergerUtility.addSource(InputStream`). I decided to test the same large file with the File solution instead (`PDFMergerUtility.addSource(File`). Well, it works if I right the stream to a file first and then merge it using the File solution. So perhaps for the larger files the `InputStream` is being closed before the merge has time to complete, hence it can't find the `%%EOF` marker? – risingTide Jul 30 '18 at 18:38
  • 1
    @risingTide This is strange. Internally `PDFMergerUtility.addSource(File)` transforms a file to a `FileInputStream`. Other than that there isn't any difference. – Master_ex Jul 31 '18 at 07:38
  • @TilmanHausherr & Master_ex - I'm working on getting a large file that doesn't have proprietary information that I can share with you two. In the meantime I posted the relevant portion of my code in the post above. I'm still leaning toward the notion that the inputstream is being closed (or something) before the merge is complete since it only takes place on large files and reading the same large files directly from a folder works. – risingTide Jul 31 '18 at 20:14
  • @TilmanHausherr - Figured it out; I posted in an answer below. – risingTide Aug 01 '18 at 02:06

2 Answers2

2

As I suspected, this was an issue with the InputStream. It wasn't exactly what I thought, but basically I was making the (very wrong) assumption that I could just do this:

           pdfStream = connection.getInputStream();
                /* ... */
           merger.addSource(pdfStream);

Of course, that's not going to work because the entire InputStream may or may not be read. It needs to be read in explicitly until the last -1 byte is reached. I'm pretty sure that on the smaller files this was working fine and actually reading in the entire stream, but on the larger files it simply wasn't making it to the end...hence not finding the %%EOF marker.

The solution was to use an intermediary ByteArrayOutputStream and then convert that back to an InputStream via a ByteArrayInputStream.

So if you replace this line of code:

pdfStream = connection.getInputStream();

above with this code:

                final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();

                int c;
                while ((c = connection.getInputStream().read()) != -1) {
                    byteArrayOutputStream.write(c);
                }

                pdfStream = new ByteArrayInputStream(byteArrayOutputStream.toByteArray());

you'll end up with a working example.

I may end up changing this to implementation to use Pipes or Circular Buffers instead, but at least this is working for now.

While this wasn't necessarily a Java 101 mistake, it was more like a Java 102 mistake and is still shameful. :/ Hopefully it will help someone else.

Thanks to @Tilman Hausherr and @Master_ex for all there help!

risingTide
  • 1,754
  • 7
  • 31
  • 60
  • Nice! It seems strange to me the `PDFBox` internally does something [similar](https://github.com/apache/pdfbox/blob/2.0.11/pdfbox/src/main/java/org/apache/pdfbox/io/ScratchFile.java#L422). Maybe in your case the connection was timing out or something? Anyway, you got it working :-) – Master_ex Aug 01 '18 at 07:58
  • I agree with Master_ex. PDFBox also copies the stream into its own buffer. (Which is kindof inefficient if the input comes from a file, but that's another story) – Tilman Hausherr Aug 01 '18 at 10:30
  • Do you know the size of your PDF? See https://stackoverflow.com/questions/263013/java-urlconnection-how-can-i-find-out-the-size-of-a-web-file If yes, then you should be able to check whether you get all or not. – Tilman Hausherr Aug 01 '18 at 10:35
0

I took a look in the code and found out that the default EOFLookupRange in COSParser is 2048 bytes.

I think that your assumption is valid.

Looking the PDFParser which extends the COSParser and is the parser used internally by the PDFMergerUtility I see that it is possible to set another EOFLookupRange by using a system property. The system property name is org.apache.pdfbox.pdfparser.nonSequentialPDFParser.eofLookupRange and it should be a valid integer.

Here is a question demonstrating how to set system properties.

I haven't tested the above but I hope it will work :)

The links to the PDFBox code use the 2.0.11 version which is the one that you are using.

Master_ex
  • 789
  • 6
  • 12
  • Wow. Thanks for your extremely detailed answer. I would like to mark it as accepted but unfortunately I haven't gotten it to work yet. I set the system property above like you stated (trying multiple values as large `160000`) just before I created the `PDFMergerUtility` (making sure I printed it out after I set it to verify), but it didn't make a difference. Perhaps there is a certain place I need to set it? Or perhaps this isn't the solution? – risingTide Jul 30 '18 at 14:13
  • 1
    The OP wrote that the %%EOF is really at the end, so changing that system property wouldn't change anything. – Tilman Hausherr Jul 30 '18 at 18:07
  • 1
    @risingTide Hey, I was checking the code again and I thing that the exceptions that you mentioned mean that the pdf is malformed in some way (i.e. missing xref table). Other than that I guess Tilman is right, you have to provide a MCVE for us to stop guessing :) – Master_ex Jul 31 '18 at 07:35
  • 1
    @Master_ex - Finally got it; check the answer below. :p – risingTide Aug 01 '18 at 02:07