Getting none from fields while parsing a pdf file

Question

I am trying to parse a pdf file. I want to get all the values in a list or dictionary of the checkbox values. But I am getting this error.

"return OrderedDict((k, v.get('/V', '')) for k, v in fields.items()) AttributeError: 'NoneType' object has no attribute 'items'"

The code I am trying is this

from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader

def _getFields(obj, tree=None, retval=None, fileobj=None):
    
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
                       '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
        catalog = obj.trailer["/Root"]
        # get the AcroForm tree
        if "/AcroForm" in catalog:
            tree = catalog["/AcroForm"]
        else:
            return None
    if tree is None:
        return retval

    obj._checkKids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            # Tree is a field
            obj._buildField(tree, retval, fileobj, fieldAttributes)
            break

    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.getObject()
            obj._buildField(field, retval, fileobj, fieldAttributes)

    return retval

def get_form_fields(infile):
    infile = PdfFileReader(open(infile, 'rb'))
    fields = _getFields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())

if __name__ == '__main__':
    from pprint import pprint

    pdf_file_name = 'Guild.pdf'

    pprint(get_form_fields(pdf_file_name))

`retval` is None, and the document does not contain an `AcroForm`, so your `_getFields` function returns `None`. There are no fields. Perhaps you need to add `if not fields: return None`. — Tim Roberts, Apr 18 '22 at 20:07
Why did you write `return None` in `_getFields`? Or why did you think that `None` has an attribute `items`? I'm a bit confused what exactly your question is. — mkrieger1, Apr 18 '22 at 20:07
I want to get all the values of the checkboxes that are in the pdf file. @mkrieger1 — saxope, Apr 18 '22 at 20:09
so what should I change that will get all the values of the checkboxes correctly? can you help? @TimRoberts — saxope, Apr 18 '22 at 20:10
That's just not how PDF files work. Those checkboxes do not HAVE values. They're just individual characters within the file. You're going to have to use the PyPDF2 APIs to extract the text from that document, and see if it gives you the box characters. — Tim Roberts, Apr 18 '22 at 20:17
how to do it then? that is my question. can you help? @TimRoberts — saxope, Apr 18 '22 at 20:22
It's not going to be easy. I can use `pdfminer` to extract all of the text, but it doesn't get the checkboxes. You may have to look at the individual objects in the PDF file, and I'm not sure PyPDF2 goes to that level of detail. — Tim Roberts, Apr 18 '22 at 20:29
that is the problem. I need the values of the checkboxes. @TimRoberts — saxope, Apr 18 '22 at 20:31
No, you can't say that. The checkboxes don't have values. Looking at the internals, the checkboxes are just images, so you can't even tell which image is "checked" and which is "unchecked" without a human eye. You have a lot of work ahead of you. If the forms are all the same, it MIGHT be easier to convert the PDFs to PNGs, and use an image library to look at the known locations of the checkboxes. — Tim Roberts, Apr 18 '22 at 20:38
the code you show is copy-pasted from another [qustion](https://stackoverflow.com/questions/3984003/how-to-extract-pdf-fields-from-a-filled-out-form-in-python). I checked the pdf-link you provided, but you checked the box or got the pdf already checked? — cards, Apr 18 '22 at 20:39
the pdf is already checked. I just need to get the values of the checkboxes as a list or dictionary. @cards — saxope, Apr 18 '22 at 20:41
the code that you are using works when the fields are "actives". In that case it seems more that were just symbols of a box with a cross — cards, Apr 18 '22 at 20:43
try to make a new pdf containing some fields and run that code... — cards, Apr 18 '22 at 20:44
so what should I do to extract the checkboxes values of the pdf I provided? do you have any suggestions? @cards — saxope, Apr 18 '22 at 20:44
OCR is one way to do it. I suspect it will be the easiest way. — Tim Roberts, Apr 18 '22 at 20:44
the problem is the pdf files cannot change. It is system generated and look exactly like this @cards — saxope, Apr 18 '22 at 20:45
the problem is that code is useless in this case. forget about pdfminer, pypdf, ... use an ocr approach — cards, Apr 18 '22 at 20:46
thanks for the answers. will try the ocr way then. @TimRoberts — saxope, Apr 18 '22 at 20:51
@KJ -- There are two 18x19 images in there. I assume one is filled, one is unfilled. The 150x158 image is probably the logo. — Tim Roberts, Apr 18 '22 at 22:16

score 0 · Answer 1 · answered Apr 18 '22 at 20:10

0

After tracing through your code, on the 10th line it seems that catalog stores the value {'/Metadata': IndirectObject(16, 0), '/Pages': IndirectObject(1, 0), '/Type': '/Catalog'}, meaning /AcroForm is not a key in the dictionary and your function returns None.

answered Apr 18 '22 at 20:10

Gareth Ma

199
10

So what should I write instead of AcroForm? Any suggestions? – saxope Apr 18 '22 at 20:12
@saxope PyPDF2 seems to be [dead](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file). – Gareth Ma Apr 18 '22 at 20:12
so what should I use to parse the values of the checkboxes of the pdf file? do you have any suggestion? – saxope Apr 18 '22 at 20:15
I visited it actually. The answers seems to talk about text extraction. I need to extract the values of the checkboxes. Can you provide a link which contains that information? I did not find it there – saxope Apr 18 '22 at 20:20

score -1 · Answer 2 · answered Apr 18 '22 at 20:09

-1

Your _getFields explicitly returns None from first if block. So basically that's where you could get this error from.

answered Apr 18 '22 at 20:09

evtn

92
1
3

Getting none from fields while parsing a pdf file

2 Answers2