1

I am trying to populate repeated forms with PDFbox. I am using a TreeMap and populating the forms with individual records. The format of the pdf form is such that there are six records listed on page one and a static page inserted on page two. (For a TreeMap larger than six records, the process repeats). The error Im getting is specific to the size of the TreeMap. Therein lies my problem. I can't figure out why when I populate the TreeMap with more than 35 entries I get this warning:

Apr 23, 2018 2:36:25 AM org.apache.pdfbox.cos.COSDocument finalize WARNING: Warning: You did not close a PDF Document

public class test {
    public static void main(String[] args) throws IOException,         IOException {
    // TODO Auto-generated method stub
    File dataFile = new File("dataFile.csv");
    File fi = new File("form.pdf");
    Scanner fileScanner = new Scanner(dataFile);
    fileScanner.nextLine();
    TreeMap<String, String[]> assetTable = new TreeMap<String, String[]>();
    int x = 0;
    while (x <= 36) {
        String lineIn = fileScanner.nextLine();
        String[] elements = lineIn.split(",");
        elements[0] = elements[0].toUpperCase().replaceAll(" ", "");
        String key = elements[0];
        key = key.replaceAll(" ", "");
        assetTable.put(key, elements);
        x++;
    }
    PDDocument newDoc = new PDDocument();
    int control = 1;
    PDDocument doc = PDDocument.load(fi);
    PDDocumentCatalog cat = doc.getDocumentCatalog();
    PDAcroForm form = cat.getAcroForm();
    for (String s : assetTable.keySet()) {
        if (control <= 6) {
            PDField IDno1 = (form.getField("IDno" + control));
            PDField Locno1 = (form.getField("locNo" + control));
            PDField serno1 = (form.getField("serNo" + control));
            PDField typeno1 = (form.getField("typeNo" + control));
            PDField maintno1 = (form.getField("maintNo" + control));
            String IDnoOne = assetTable.get(s)[1];
            //System.out.println(IDnoOne);
            IDno1.setValue(assetTable.get(s)[0]);
            IDno1.setReadOnly(true);
            Locno1.setValue(assetTable.get(s)[1]);
            Locno1.setReadOnly(true);
            serno1.setValue(assetTable.get(s)[2]);
            serno1.setReadOnly(true);
            typeno1.setValue(assetTable.get(s)[3]);
            typeno1.setReadOnly(true);
            String type = "";
            if (assetTable.get(s)[5].equals("1"))
                type += "Hydrotest";
            if (assetTable.get(s)[5].equals("6"))
                type += "6 Year Maintenance";
            String maint = assetTable.get(s)[4] + " - " + type;
            maintno1.setValue(maint);
            maintno1.setReadOnly(true);
            control++;
        } else {
            PDField dateIn = form.getField("dateIn");
            dateIn.setValue("1/2019 Yearlies");
            dateIn.setReadOnly(true);
            PDField tagDate = form.getField("tagDate");
            tagDate.setValue("2019 / 2020");
            tagDate.setReadOnly(true);
            newDoc.addPage(doc.getPage(0));
            newDoc.addPage(doc.getPage(1));
            control = 1;
            doc = PDDocument.load(fi);
            cat = doc.getDocumentCatalog();
            form = cat.getAcroForm();
        }
    }
    PDField dateIn = form.getField("dateIn");
    dateIn.setValue("1/2019 Yearlies");
    dateIn.setReadOnly(true);
    PDField tagDate = form.getField("tagDate");
    tagDate.setValue("2019 / 2020");
    tagDate.setReadOnly(true);
    newDoc.addPage(doc.getPage(0));
    newDoc.addPage(doc.getPage(1));
    newDoc.save("PDFtest.pdf");
    Desktop.getDesktop().open(new File("PDFtest.pdf"));

}

I cant figure out for the life of me what I'm doing wrong. This is the first week I've been working with PDFbox so I'm hoping its something simple.

Updated Error Message

WARNING: Warning: You did not close a PDF Document
Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
    at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:77)
    at org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:125)
    at org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1200)
    at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:383)
    at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:158)
    at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:522)
    at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:460)
    at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:444)
    at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1096)
    at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:419)
    at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1367)
    at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1254)
    at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1232)
    at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1204)
    at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1192)
    at test.test.main(test.java:87)
Ken Hall
  • 127
  • 2
  • 11
  • It should also be noted that the x value in the while loop is what im restricting the size of the TreeMap with. – Ken Hall Apr 23 '18 at 03:03
  • 1
    You pass resources from one document to another and then "lose" the reference to the source document (in the second `doc = PDDocument.load(fi);`), so it gets closed by gc. Don't do that. Make sure your document stays open when you use its resources, and close explicitely. – Tilman Hausherr Apr 23 '18 at 11:53

2 Answers2

7

The warning by itself

You appear to get the warning wrong. It says:

Warning: You did not close a PDF Document

So in contrast to what you think, "PDFbox saying PDDocument closed when its not", PDFBox says that you did not close a document!

After your edit one sees that it actually says that a COSStream has been closed and that a possible cause is that the enclosing PDDocument already has been closed. This is a mere possibility!

The warning in your case

That been said, by adding pages from one document to another you probably end up having references to those pages from both documents. In that case in the course of closing both documents (e.g. automatically via garbage collection), the second one closing may indeed stumble across some already closed COSStream instances.

So my first advice to simply do close the documents at the end by

doc.close();
newDoc.close();

probably won't remove the warnings, merely change their timing.

Actually you don't merely create two documents doc and newDoc, you even create new PDDocument instances and assign them to doc again and again, in the process setting the former document objects in that variable free for garbage collection. So you eventually have a big bunch of documents to be closed as soon as not referenced anymore.

I don't think it would be a good idea to close all those documents in doc early, in particular not before saving newDoc.

But if your code will eventually be run as part of a larger application instead of as a small, one-shot test application, you should collect all those PDDocument instances in some Collection and explicitly close them right after saving newDoc and then clear the collection.

Actually your exception looks like one of those lost PDDocument instances has already been closed by garbage collection, so you should collect the documents even in case of a simple one-shot utility to keep them from being GC disposed.

(@Tilman, please correct me if I'm wrong...)

Importing pages

To prevent problems with different documents sharing pages, you can try and import the pages to the target document and thereafter add the imported page to the target document page tree. I.e. replace

newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));

by

newDoc.addPage(newDoc.importPage(doc.getPage(0)));
newDoc.addPage(newDoc.importPage(doc.getPage(1)));

This should allow you to close each PDDocument instance in doc before losing it. There are certain drawbacks to this, though, cf. the method JavaDoc and this answer here.

An actual issue in your code

In your combined document you will have many fields with the same name (at least in case of a sufficiently high number of entries in your CSV file) which you initially set to different values. And you access the fields from the PDAcroForm of the respective original document but don't add them to the PDAcroForm of the combined result document.

This is asking for trouble! The PDF format does consider forms to be document-wide with all fields referenced (directly or indirectly) from the AcroForm dictionary of the document, and it expects fields with the same name to effectively be different visualizations of the same field and therefore to all have the same value.

Thus, PDF processors might handle your document fields in unexpected ways, e.g.

  • by showing the same value in all fields with the same name (as they are expected to have the same value) or
  • by ignoring your fields (as they are not in the document AcroForm structure).

In particular programmatic reading of your PDF field values will fail because in that context the form is definitively considered document-wide and based in AcroForm. PDF viewers on the other hand might first show your set values and make look things ok.

To prevent this you should rename the fields before merging. You might consider using the PDFMergerUtility which does such a renaming under the hood. For an example usage of that utility class have a look at the PDFMergerExample.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Sorry, I should have copied more of the error message. It goes on to say that the PDDocument is already closed. I'll copy and paste the exact verbiage when I get back in front of that computer. – Ken Hall Apr 23 '18 at 11:37
  • *"It goes on to say that the PDDocument is already closed."* - No, it says that a `COSStream` has been closed. A **possible** cause is that the enclosing `PDDocument` already has been closed. This is a mere conjecture! That been said, by adding pages from one document to another you probably end up having references to those pages from both documents. In that case in the course of closing both documents, the second one closing may indeed stumble across some already closed `COSStream` instances. – mkl Apr 23 '18 at 13:02
  • Thank you for all the detail! It's much appreciated. Forgive me if I'm changing the subject to much, but... Would it more advantageous,in your opinion, for me to convert the PDPage into an image and then add it? That way I'd be able to entirely eliminate the field name issue entirely. Once this report is generated, the intent is to print it and be done, it will never need to be edited again. And if so, can you provide some direction as to where I can read up on that? This is the first program I've ever written intended for other people's use, please forgive my naivety. – Ken Hall Apr 23 '18 at 21:11
  • 1
    My end solution was far from the most prudent method. Since it was apparent that the object was getting picked up by garbage collection too soon, I simply added the object to an array of PDDocuments for each instance of the object. I KNOW THIS IS NOT GOOD PRACTICE! I also looked at the potential memory usage, and at most is uses up a few extra MB of memory. The solution works well in my case solely because of the intended use. Thank you for all the help! – Ken Hall Apr 24 '18 at 00:10
  • @Ken *"Would it more advantageous,in your opinion, for me to convert the PDPage into an image and then add it?"* - I generally think that a bad solution. Even if you generate the pdf merely for printing it, you'd have to implement that conversion to image to match your very printing method, and any change in the latter, e.g. an update of print drivers, might require adaptations in your program to still achieve optimum quality. – mkl Apr 24 '18 at 04:10
  • 1
    @Ken *"I simply added the object to an array of PDDocuments for each instance of the object. I KNOW THIS IS NOT GOOD PRACTICE!"* - it's not really bad. You need those objects to be not garbage collected, so you keep them referenced. I'd have used a some collection class, but that may be a case of personal preference... – mkl Apr 24 '18 at 04:14
1

Even though the above answer was marked as the solution to the problem, since the solution is buried in the comments, I wanted to add this answer at this level. I spent several hours searching for the solution.

My code snippets and comments.

// Collection solely for purpose of preventing premature garbage collection
List<PDDocument> sourceDocuments = new ArrayList<>( );

...

// Source document (actually inside a loop)
PDDocument docIn = PDDocument.load( artifactBytes );

// Add document to collection before using it to prevent the problem
sourceDocuments.add( docIn );

// Extract from source document 
PDPage extractedPage = docIn.getPage( 0 );
// Add page to destination document
docOut.addPage( extractedPage );

...

// This was failing with "COSStream has been closed and cannot be read."
// Now it works.
docOut.save( bundleStream );
Bruce Wilcox
  • 354
  • 1
  • 2
  • 11