How to get the fields from this PDF file? It is a dynamic PDF created by Adobe LiveCycle Designer. If you open the link in a web browser, you will probably see a single page starting from 'Please wait...' If you download the file and open it via Adobe Reader (5.0 or higher), you should see all 8 pages.
So, when reading via PyPDF2
, you get an empty dictionary because it renders the file as a single page like that you see via a web browser.
def print_fields(path):
from PyPDF2 import PdfFileReader
reader = PdfFileReader(str(path))
fields = reader.getFields()
print(fields)
You can use Java-dependent library tika
to read the contents for all 8 pages. However the results are messy and I am avoiding Java dependency.
def read_via_tika(path):
from tika import parser
raw = parser.from_file(str(path))
content = raw['content']
print(content)
So, basically, I can manually Edit -> Form Options -> Export Data…
in Adobe Actobat DC to get a nice XML. Similarly, I need to get the nice form fields and their values via Python.