6

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages, and its causing to write text out to incorrect locations/pages.

ie. I'm processing fields per page, but not sure which fields are on which pages.

Is there a way to tell which field is on which page? Or, is there a way to get just the fields on the current page?

Thank you!

Mark

code snippet:

PDDocument pdfDoc = PDDocument.load(file);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();

// Get field names
List<PDField> fieldList = acroForm.getFields();
List<PDPage> pages = pdfDoc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
  PDPageContentStream contentStream = new PDPageContentStream(pdfDoc, page, true, true, true);
  processFields(acroForm, fieldList, contentStream, page);
  contentStream.close();
}
Mark Waschkowski
  • 395
  • 1
  • 4
  • 10

4 Answers4

8

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages

The reason for this is that PDFs contain a global object structure defining the form. A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages. Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.

PDFBox 1.8.x

Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages. By accessing the underlying structures, though, you can build the connection.

The following code should make clear how to do that:

@SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
    PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();

    List<PDPage> pages = docCatalog.getAllPages();
    Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
    for (int i = 0; i < pages.size(); i++) {
        PDPage page = pages.get(i);
        for (PDAnnotation annotation : page.getAnnotations())
            pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
    }

    PDAcroForm acroForm = docCatalog.getAcroForm();

    for (PDField field : (List<PDField>)acroForm.getFields()) {
        COSDictionary fieldDict = field.getDictionary();

        List<Integer> annotationPages = new ArrayList<Integer>();
        List<COSObjectable> kids = field.getKids();
        if (kids != null) {
            for (COSObjectable kid : kids) {
                COSBase kidObject = kid.getCOSObject();
                if (kidObject instanceof COSDictionary)
                    annotationPages.add(pageNrByAnnotDict.get(kidObject));
            }
        }

        Integer mergedPage = pageNrByAnnotDict.get(fieldDict);

        if (mergedPage == null)
            if (annotationPages.isEmpty())
                System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
            else
                System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
        else
            if (annotationPages.isEmpty())
                System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
            else
                System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
    }
}

Beware, there are two shortcomings in the PDFBox PDAcroForm form field handling:

  1. The PDF specification allows the global object structure defining the form to be a deep tree, i.e. the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDFBox ignores this and expects the fields to be direct children of the root.

  2. Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations. PDFBox does not see these fields in its PDAcroForm.getFields list.

PS: @mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf. While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary. Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.

PDFBox 2.0.x

In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.

The safe methods:

int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
    COSDictionary widgetObject = widget.getCOSObject();
    PDPageTree pages = document.getPages();
    for (int i = 0; i < pages.getCount(); i++)
    {
        for (PDAnnotation annotation : pages.get(i).getAnnotations())
        {
            COSDictionary annotationObject = annotation.getCOSObject();
            if (annotationObject.equals(widgetObject))
                return i;
        }
    }
    return -1;
}

The fast method

int determineFast(PDDocument document, PDAnnotationWidget widget)
{
    PDPage page = widget.getPage();
    return page != null ? document.getPages().indexOf(page) : -1;
}

Usage:

PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
    for (PDField field : acroForm.getFieldTree())
    {
        System.out.println(field.getFullyQualifiedName());
        for (PDAnnotationWidget widget : field.getWidgets())
        {
            System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
            System.out.printf(" - fast: %s", determineFast(document, widget));
            System.out.printf(" - safe: %s\n", determineSafe(document, widget));
        }
    }
}

(DetermineWidgetPage.java)

(In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.)

Example documents

A document for which the fast method fails: aFieldTwice.pdf

A document for which the fast method works: test_duplicate_field2.pdf

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
3

Granted this answer may not help the OP (a year later), but if someone else runs into it, here is the solution:

PDDocumentCatalog catalog = doc.getDocumentCatalog();

int pageNumber = catalog.getAllPages().indexOf(yourField.getWidget().getPage());
HRÓÐÓLFR
  • 5,842
  • 5
  • 32
  • 35
  • If a field has multiple widgets on multiple pages, the page of which widgets do you get? – mkl Jul 16 '15 at 21:36
  • @mkl thats a good question. the docs say that it will get "the single associated widget that is part of this field." not entirely clear what happens in the case you are referring to – HRÓÐÓLFR Jul 17 '15 at 18:09
  • "the single associated widget that is part of this field" sounds like covering the case of the widget object being merged into the field object. This merging is allowed for form fields with a single widget only. – mkl Jul 18 '15 at 05:31
  • yeah... I'm struggling with this issue in a project currently, and I've come across a pdf where the widget has no page associated with it (or something, .getPage() returns null) – HRÓÐÓLFR Jul 20 '15 at 19:28
  • Ok, I've looked at the sources. **A** `getWidget` returns the widget merged into the field dictionary or the first widget in the **Kids** array or `null` in case of an empty **Kids** array. **B** `getPage` returns the page referred to in the **P** entry. This entry in general is optional. Thus, `null` is a result which will happen every once I. a while. – mkl Jul 20 '15 at 21:13
1

This example uses Lucee (cfml) https://lucee.org/

A big thank you to mkl as his answer above is invaluable and I couldn't have built this function without his help.

Call the function: pageForSignature(doc, fieldName) and it will return the page no that the fieldname resides on. Returns -1 if fieldName not found.

  <cfscript>
  try{

  /*
  java is used by using CreateObject()
  */

  variables.File = CreateObject("java", "java.io.File");

  //references lucee bundle directory - typically on tomcat: /usr/local/tomcat/lucee-server/bundles
  variables.PDDocument = CreateObject("java", "org.apache.pdfbox.pdmodel.PDDocument", "org.apache.pdfbox.app", "2.0.18")

  function determineSafe(doc, widget){

    var i = '';
    var widgetObject = widget.getCOSObject();
    var pages = doc.getPages();
    var annotation = '';
    var annotationObject = '';

    for (i = 0; i < pages.getCount(); i=i+1){

    for (annotation in pages.get(i).getAnnotations()){
        if(annotation.getSubtype() eq 'widget'){
            annotationObject = annotation.getCOSObject();
            if (annotationObject.equals(widgetObject)){
                return i;
            }
        }
    }

    }
    return -1;
  }

  function pageForSignature(doc, fieldName){
    try{
    var acroForm = doc.getDocumentCatalog().getAcroForm();
    var field = '';
    var widget = '';
    var annotation = '';
    var pageNo = '';

    for(field in acroForm.getFields()){

    if(field.getPartialName() == fieldName){

        for(widget in field.getWidgets()){

           for(annotation in widget.getPage().getAnnotations()){

             if(annotation.getSubtype() == 'widget'){

                pageNo = determineSafe(doc, widget);
                doc.close();
                return pageNo;
             }
           }

        }
    }
  }
return -1;  
}catch(e){
    doc.close();
writeDump(label="catch error",var='#e#');
  }
} 

doc = PDDocument.init().load(File.init('/**********/myfile.pdf'));

//returns no,  page numbers start at 0
pageNo = pageForSignature(doc, 'twtzceuxvx');

writeDump(label="pageForSignature(doc, fieldName)", var="#pageNo#");
</cfscript
user2677034
  • 624
  • 10
  • 20
0

General solution for single or multiple widget of (duplicate qualified name of single page)..

List<PDAnnotationWidget>  widget=field.getWidgets();
PDDocumentCatalog catalog = doc.getDocumentCatalog();
for(int i=0;i<widget.size();i++) {
int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(i).getPage());

/* field co ordinate also can get here for single or multiple both it will work..*/

//PDRectangle r= widget.get(i).getRectangle();

}
kamlesh
  • 139
  • 2
  • 13
  • The entry whose value `getPage` returns is optional. You are very likely to get `null` at least as often as a `PDPage` instance if presented with PDFs from the wild. – mkl Jul 04 '17 at 12:45
  • which version you are using .. in pdfbox 2. x version i did not found any method getAllPages of PDDocumentCatalog class. – kamlesh Jul 05 '17 at 05:25
  • i did not getting any null value of my pdf have 2 page , 1 page have a radio button with same name 3 widget(field) and in 2 nd page checkbox having 2 field with same name..can you please send your pdf or which scenario you are getting null...for getting only page no you can use int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(0).getPage()); – kamlesh Jul 05 '17 at 05:37
  • *i did not getting any null value of my pdf* - as mentioned the value is optional. If your PDF is created by a PDF producer that adds this value, your code works and is fairly fast. But other producers are likely not to add the value. You can use your code as the fast route, and if that route fails, use code akin to the code from my old answer. – mkl Jul 05 '17 at 06:56
  • *"which version you are using .. in pdfbox 2. x version i did not found any method getAllPages of PDDocumentCatalog"* - if you refer to my answer above, please be aware that I wrote it in March 2014, so that likely was some 1.8.x version of PDFBox. In 2.0.x the preferred way to retrieve all pages is by `PDDocument.getPages()`, cf [Migration to PDFBox 2.0.0](https://pdfbox.apache.org/2.0/migration.html). – mkl Jul 05 '17 at 07:04