4

I am able to fetch the field names for most of the pdf files using pdfbox but i am not able to fetch fields income taxform. is it something restricted in that form.

though it contains multiple fields in the form, it is showing only one field.

This is the output:

topmostSubform[0].

my code:

PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List fields = acroForm.getFields();

@SuppressWarnings("rawtypes")
java.util.Iterator fieldsIter = fields.iterator();
System.out.println(new Integer(fields.size()).toString());
while( fieldsIter.hasNext())
{
    PDField field = (PDField)fieldsIter.next();
    System.out.println(field.getFullyQualifiedName());
    System.out.println(field.getPartialName());
}

used in

public static void main(String[] args) throws IOException {
    PDDocument pdDoc = null;
    try {
        pdDoc = PDDocument.load("income.pdf");
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace(); 
    }
    Ggdfgdgdgf feilds = new Ggdfgdgdgf();
    feilds.printFields(pdDoc);
}
Baswa Prasad
  • 65
  • 1
  • 1
  • 8

3 Answers3

9

The PDF in question is a hybrid AcroForm/XFA form. This means that it contains the form definition both in AcroForm and in XFA format.

PDFBox primarily supports AcroForm (which is the PDF form technology presented in the PDF specification), but as both formats are present, PDFBox can at least inspect the AcroForm form definition.

Your code ignores that AcroForm.getFields() does not return all field definitions but merely the definitions of the root fields, cf. the JavaDoc comments:

/**
 * This will return all of the documents root fields.
 * 
 * A field might have children that are fields (non-terminal field) or does not
 * have children which are fields (terminal fields).
 * 
 * The fields within an AcroForm are organized in a tree structure. The documents root fields 
 * might either be terminal fields, non-terminal fields or a mixture of both. Non-terminal fields
 * mark branches which contents can be retrieved using {@link PDNonTerminalField#getChildren()}.
 * 
 * @return A list of the documents root fields.
 * 
 */
public List<PDField> getFields()

If you want to access all fields, you have to walk the form field tree, e.g. like this:

public void test() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("f2290.pdf"))
    {
        PDDocument pdfDocument = PDDocument.load(resource);
        PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
        PDAcroForm acroForm = docCatalog.getAcroForm();
        List<PDField> fields = acroForm.getFields();
        for (PDField field : fields)
        {
            list(field);
        }
    }
}

void list(PDField field)
{
    System.out.println(field.getFullyQualifiedName());
    System.out.println(field.getPartialName());
    if (field instanceof PDNonTerminalField)
    {
        PDNonTerminalField nonTerminalField = (PDNonTerminalField) field;
        for (PDField child : nonTerminalField.getChildren())
        {
            list(child);
        }
    }
}

This returns a huge list of fields for your document.

PS: You have not stated which PDFBox version you use. As currently PDFBox development clearly has begun recommending the use of the current 2.0.0 release candidates, I assumed in my answer that you use that version.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • unable to create PDNonTerminalField for the above code – Baswa Prasad Feb 29 '16 at 11:23
  • @BaswaPrasad *unable to create PDNonTerminalField* - who says you should create it? – mkl Feb 29 '16 at 11:25
  • i mean to say that i a not able to import it – Baswa Prasad Feb 29 '16 at 11:28
  • i am currently using pdfbox 0.7.3 – Baswa Prasad Feb 29 '16 at 11:28
  • thanks though i use your above code i am getting the same output:topmostSubform[0] – Baswa Prasad Feb 29 '16 at 11:44
  • 3
    0.7.3? Wow, ancient. I'm afraid you'll need 2.0.0 – mkl Feb 29 '16 at 11:49
  • i have changed the jar to 2.0.0 and executed. after that only i said same output is coming – Baswa Prasad Feb 29 '16 at 11:52
  • That does not match my observations. – mkl Feb 29 '16 at 12:36
  • now in my pdf there are multiple fields around 100's of fields, how do i set value to large number of fileds. for 1 or 2 fields i am using following code to set value, suggest me for so many fields. PDField fieldname1 = acroForm.getField("topmostSubform[0].Page1[0].f1_001_0_[0]"); if (fieldname1 != null) { fieldname1.setValue("xyz"); } – Baswa Prasad Mar 01 '16 at 04:54
  • 1
    @Baswa *how do i set value to large number of fileds* - there is no special method for setting multiple fields, so you have tho do it one field at a time. Obviously, though, you can keep field names and field values to set in a `Map` and then iterate over the map entries to set the field values in a loop. – mkl Mar 01 '16 at 05:16
  • can you just give me an example to set the fields values in loop? because my values should come dynamically from database – Baswa Prasad Mar 01 '16 at 06:02
  • 1
    As I have no idea how in your case data base query results are to be mapped to PDF field names and values, I can hardly give a sample which is applicable in your situation and not trivial. – mkl Mar 01 '16 at 07:18
  • 2
    For completeness to iterate though all field using PDFBox 2.0.0 you can do PDAcroForm form; ... for (PDField field : form.getFieldTree()) { ... (do something) } – Maruan Sahyoun Mar 03 '16 at 21:57
  • How would one get all fields if there is no Acroform? This doesn't work for all kinds of pdfs. – Gamebuster19901 Oct 08 '22 at 03:31
  • If there is no **AcroForm** in the **Catalog** but there are form field widget annotations on some pages, the pdf strictly speaking is broken. In that case you can try and repair it by collecting the fields underneath those widget annotations, determining their root fields, and creating an **AcroForm** definition based on them. – mkl Oct 08 '22 at 05:43
1

This can be done a lot easier using fieldTree

fun getFieldsInDocument(file: File): List<String> {
    return PDDocument.load(file).use { document ->
        document.documentCatalog.acroForm.fieldTree
                .filter { it is PDTerminalField }
                .map { field ->
                    field.fullyQualifiedName
                }
    }
}

This is Kotlin but in Java it looks basically the same.

Christoph Grimmer
  • 4,210
  • 4
  • 40
  • 64
-1

Here is the sample code for reading a pdf.Before use it, set your input PDF file.

import java.io.File;
import java.io.FileInputStream;
import java.io.PrintWriter;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class JavaApplication14 {

    PDFParser parser;
    String parsedText;
    PDFTextStripper pdfStripper;
    PDDocument pdDoc;
    COSDocument cosDoc;
//    PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor 
    public JavaApplication14() {
    }
// Extract text from PDF Document
    String pdftoText(String fileName) {
        System.out.println("Parsing text from PDF file " + fileName + "....");
        File f = new File(fileName);
        if (!f.isFile()) {
            System.out.println("File " + fileName + " does not exist.");
            return null;
        }
        try {
            parser = new PDFParser(new FileInputStream(f));
        } catch (Exception e) {
            System.out.println("Unable to open PDF Parser.");
            return null;
        }
        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            parsedText = pdfStripper.getText(pdDoc);
        } catch (Exception e) {
            System.out.println("An exception occured in parsing the PDF Document.");
            e.printStackTrace();
            try {
                if (cosDoc != null) {
                    cosDoc.close();
                }
                if (pdDoc != null) {
                    pdDoc.close();
                }
            } catch (Exception e1) {
                e.printStackTrace();
            }
            return null;
        }
        System.out.println("Done.");
        return parsedText;
    }
// Write the parsed text from PDF to a file
    void writeTexttoFile(String pdfText, String fileName) {
        System.out.println("\nWriting PDF text to output text file " + fileName + "....");
        try {
            PrintWriter pw = new PrintWriter(fileName);
            pw.print(pdfText);
            pw.close();
        } catch (Exception e) {
            System.out.println("An exception occured in writing the pdf text to file.");
            e.printStackTrace();
        }
        System.out.println("Done.");
    }

    public static void main(String args[]) {
        String fileList[] = {"E:\\JavaApplication14\\src\\javaapplication14\\issues.pdf", "E:\\JavaApplication14\\src\\javaapplication14\\newTextDocument.txt"};
        if (fileList.length != 2) {
            System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
            System.exit(1);
        }
        JavaApplication14 pdfTextParserObj = new JavaApplication14();
        String pdfToText = pdfTextParserObj.pdftoText(fileList[0]);
        if (pdfToText == null) {
            System.out.println("PDF to Text Conversion failed.");
        } else {
            System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
            pdfTextParserObj.writeTexttoFile(pdfToText, fileList[1]);
        }
    }
}
Ataur Rahman Munna
  • 3,887
  • 1
  • 23
  • 34