Error using Stanford NER on PDF converted to TXT by PDFminer in Python?

Question

Disclaimer: I'm fairly new to Python and programming in general. This question has a few different components - answers to any or all of them would be extremely helpful.

I'm trying to write a program in Python to extract location names from foreign aid documents.

These documents are typically PDF files, so initially I converted them from PDF to TXT with Adobe Reader. But I want to integrate the process into my program, so I installed PDFMiner and have been testing out code from a previous stack overflow question (How do I use pdfminer as a library) to convert them. This is the code I'm currently using:

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from cStringIO import StringIO

    def convert_pdf_to_txt(path):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        fp = file(path, 'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos=set()
        for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,                                             
        password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
        fp.close()
        device.close()
        str = retstr.getvalue()
        retstr.close()
        return str

When I print the output, it looks the same as the text created by Adobe Reader in the shell, but the Stanford NER tagger isn't finding any entities. I'm using pyner (https://github.com/dat/pyner) to implement Stanford NER, and it's just returning empty dictionaries for each sentence. It's not a problem with the socket, it worked before on the Adobe-converted files. This is my code for implementing Stanford NER:

    import ner

    def findloc(text):
        tagger = ner.SocketNER(host = 'localhost', port = 8080)  
        loclist = []
        sentence = ""
        for char in text: 
            if char == ".":
                sentence += "."
                tagsent = tagger.get_entities(sentence)
                if u'LOCATION' in tagsent:
                    loclist.extend(tagsent[u'LOCATION'])
                sentence = ""
            else:
                sentence += char
        return [x.encode('ascii').lower() for x in loclist]

In the terminal, this error is being thrown fairly frequently (and was thrown occasionally with the Adobe-converted files) when the code is run:

edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+F0B7, decimal: 61623)

Why is this happening and how can I fix it?

Here's an example document I've been working with for reference: http://www-wds.worldbank.org/external/default/WDSContentServer/WDSP/IB/2009/02/03/000350881_20090203110828/Rendered/PDF/432750PJPR0BR010P1028180Box0334125B.pdf

Side note: As you can see, not everything is in sentence-format, so ideally in the future my text mining program would also be able to recognize tables and such. But I'm new to this and very unaware of how to implement that, so I want to get a handle of this basic Named Entity Recognizer first. However, if you have any suggestions, I'm very open to all the help I can get.

Thanks so much in advance!

Error using Stanford NER on PDF converted to TXT by PDFminer in Python?

0 Answers0