OpenPDF/iText corrupt documents

Question

I've been trying to re-implement the concatenate example from OpenPDF 1.2.4 and 1.2.11 in Scala:

def mergePdfs(docs: Seq[Array[Byte]]): Array[Byte] = {
    log.debug(s"merging ${docs.size} PDFs")
    val output = new ByteArrayOutputStream()
    val document = new Document()
    val copy = new PdfCopy(document, output)
    getPageSize(docs.headOption) foreach document.setPageSize
    document.open()
    docs foreach { doc =>
      val reader = new PdfReader(doc)
      1 to reader.getNumberOfPages foreach { pageNum =>
        copy.addPage(copy.getImportedPage(reader, pageNum))
      }
    }
    document.close()
    output.toByteArray
  }

~~Here~~ Here is an example output document. I generated it from two copies of this and then three copies of this.

I am seeing two issues:

~~- Document is corrupt (only opens in FireFox), partly due to a line of cruft immediately between the header and the first object. Deleting the offending line does not fix the document~~ error in client code, thanks @mkl!

Some pages (usually one but it's non-deterministic) appear blank. No pattern I've seen in which. Additionally, each page's text appears twice in the file. e.g. in the example above:

$ strings out.pdf | grep "A Simple PDF File" | wc -l | tr -d ' '
6

In one case I used vim to delete the first content stream and that caused the text to appear on the first page.

Am I misusing the API in some way?

At first glance the code looks ok (even though I don't understand what `pageSize.foreach(document.setPageSize)` might do meaningfully). Could it be that you outside this method treat the byte array returned here as text? Please share an example problem PDF for analysis. — mkl, Mar 22 '19 at 09:44
Happy to clarify! pageSize is a Maybe monad (called `Option` in Scala), which is a collection of zero or one elements. So it gets the page size of the first source document if there is one, and then if there is one it sets the page of the target document. PDFs coming. — ILikeFood, Mar 22 '19 at 17:22
Added input and output examples. Thanks for checking on whether I am handling the input and output properly! In my case all I am doing is calling Files.getAllBytes from each path in the CLI, and then calling Files.write with the resulting byte array. Not a lot of room for that kind of error to sneak in. — ILikeFood, Mar 22 '19 at 18:05
Is this a question about iText or about OpenPDF? It can't be a question about both... — Amedee Van Gasse, Mar 25 '19 at 13:27
@ILikeFood Your new *example output document* link returns the same file as the old one. — mkl, Mar 26 '19 at 09:15
That new file still has the same size as the old, defect file, 31181 bytes. This time the first 19593 bytes contain the actual file. Are you sure you are using `TRUNCATE_EXISTING` in `Files.write`? Furthermore, even that PDF consisting of the first 19593 bytes does not look like it's created from your code above, it looks like it's created using a regular `PdfWriter`, not a `PdfCopy`. — mkl, Mar 27 '19 at 11:13
I returned to my MWE and it is now producing example files of length 17,461 that look correct. I can only assume the other example is caused by miscopying something as well. I've accepted your original answer. Thank you very much for your help, and I owe you a beer if you ever find yourself in Boston. — ILikeFood, Mar 29 '19 at 16:33

score 1 · Accepted Answer · answered Mar 22 '19 at 21:17

1

The first 17465 bytes of your result file are the actual result of your code ("two copies of this and then three copies of this"). The remaining bytes of the 31181 bytes file consists of fragments of other PDFs.

In a comment you say you're "calling Files.write with the resulting byte array." Which OpenOptions are you using? Probably CREATE but not TRUNCATE_EXISTING?

answered Mar 22 '19 at 21:17

mkl

90,588
15
125
265

Classic . With `TRUNCATE_EXISTING` the file corruption issue is solved but I still get blank pages -- consistent with what we've been seeing when we execute this through the production code path (question updated). Looking forward to a nice long debugging date with just iText and me. – ILikeFood Mar 22 '19 at 22:00

OpenPDF/iText corrupt documents

1 Answers1