Read form fields in a PDF created by Adobe LiveCycle Designer

Question

How to get the fields from this PDF file? It is a dynamic PDF created by Adobe LiveCycle Designer. If you open the link in a web browser, you will probably see a single page starting from 'Please wait...' If you download the file and open it via Adobe Reader (5.0 or higher), you should see all 8 pages.

So, when reading via PyPDF2, you get an empty dictionary because it renders the file as a single page like that you see via a web browser.

def print_fields(path):
    from PyPDF2 import PdfFileReader
    reader = PdfFileReader(str(path))
    fields = reader.getFields()
    print(fields)

You can use Java-dependent library tika to read the contents for all 8 pages. However the results are messy and I am avoiding Java dependency.

def read_via_tika(path):
    from tika import parser
    raw = parser.from_file(str(path))
    content = raw['content']
    print(content)

So, basically, I can manually Edit -> Form Options -> Export Data… in Adobe Actobat DC to get a nice XML. Similarly, I need to get the nice form fields and their values via Python.

Alternatively, use this [link](https://www.uspto.gov/patent/forms/important-information-completing-application-data-sheet-ads) if the one in the question expired. — Max, Feb 21 '19 at 01:50

score 2 · Accepted Answer · answered Feb 21 '19 at 22:14

2

Thanks to this awesome answer, I managed to retrieve the fields using pdfminer.six.

Navigate through Catalog > AcroForm > XFA, then pdfminer.pdftypes.resolve1 the object right after b'datasets' element in the list.

answered Feb 21 '19 at 22:14

Max

1,685
16
21

How to navigate through Catalog > AcroForm > XFA in PDFMiner? @Max – Jinhua Wang May 09 '20 at 17:57
1

I am getting the following error: TypeError: int() argument must be a string, a bytes-like object or a number, not 'PSKeyword' – Jinhua Wang May 09 '20 at 17:58

score 1 · Answer 2 · answered May 09 '20 at 18:03

In my case, the following code worked (source: ankur garg)

import PyPDF2 as pypdf
def findInDict(needle, haystack):
    for key in haystack.keys():
        try:
            value=haystack[key]
        except:
            continue
        if key==needle:
            return value
        if isinstance(value,dict):            
            x=findInDict(needle,value)            
            if x is not None:
                return x
pdfobject=open('CTRX_filled.pdf','rb')
pdf=pypdf.PdfFileReader(pdfobject)
xfa=findInDict('/XFA',pdf.resolvedObjects)
xml=xfa[7].getObject().getData()

Read form fields in a PDF created by Adobe LiveCycle Designer

2 Answers2