1

How to get the fields from this PDF file? It is a dynamic PDF created by Adobe LiveCycle Designer. If you open the link in a web browser, you will probably see a single page starting from 'Please wait...' If you download the file and open it via Adobe Reader (5.0 or higher), you should see all 8 pages.

So, when reading via PyPDF2, you get an empty dictionary because it renders the file as a single page like that you see via a web browser.

def print_fields(path):
    from PyPDF2 import PdfFileReader
    reader = PdfFileReader(str(path))
    fields = reader.getFields()
    print(fields)

You can use Java-dependent library tika to read the contents for all 8 pages. However the results are messy and I am avoiding Java dependency.

def read_via_tika(path):
    from tika import parser
    raw = parser.from_file(str(path))
    content = raw['content']
    print(content)

So, basically, I can manually Edit -> Form Options -> Export Data… in Adobe Actobat DC to get a nice XML. Similarly, I need to get the nice form fields and their values via Python.

Max
  • 1,685
  • 16
  • 21
  • Alternatively, use this [link](https://www.uspto.gov/patent/forms/important-information-completing-application-data-sheet-ads) if the one in the question expired. – Max Feb 21 '19 at 01:50

2 Answers2

2

Thanks to this awesome answer, I managed to retrieve the fields using pdfminer.six.

Navigate through Catalog > AcroForm > XFA, then pdfminer.pdftypes.resolve1 the object right after b'datasets' element in the list.

Max
  • 1,685
  • 16
  • 21
1

In my case, the following code worked (source: ankur garg)

import PyPDF2 as pypdf
def findInDict(needle, haystack):
    for key in haystack.keys():
        try:
            value=haystack[key]
        except:
            continue
        if key==needle:
            return value
        if isinstance(value,dict):            
            x=findInDict(needle,value)            
            if x is not None:
                return x
pdfobject=open('CTRX_filled.pdf','rb')
pdf=pypdf.PdfFileReader(pdfobject)
xfa=findInDict('/XFA',pdf.resolvedObjects)
xml=xfa[7].getObject().getData()
Jinhua Wang
  • 1,679
  • 1
  • 17
  • 44