Disclaimer: I'm fairly new to Python and programming in general. This question has a few different components - answers to any or all of them would be extremely helpful.
I'm trying to write a program in Python to extract location names from foreign aid documents.
These documents are typically PDF files, so initially I converted them from PDF to TXT with Adobe Reader. But I want to integrate the process into my program, so I installed PDFMiner and have been testing out code from a previous stack overflow question (How do I use pdfminer as a library) to convert them. This is the code I'm currently using:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
When I print the output, it looks the same as the text created by Adobe Reader in the shell, but the Stanford NER tagger isn't finding any entities. I'm using pyner (https://github.com/dat/pyner) to implement Stanford NER, and it's just returning empty dictionaries for each sentence. It's not a problem with the socket, it worked before on the Adobe-converted files. This is my code for implementing Stanford NER:
import ner
def findloc(text):
tagger = ner.SocketNER(host = 'localhost', port = 8080)
loclist = []
sentence = ""
for char in text:
if char == ".":
sentence += "."
tagsent = tagger.get_entities(sentence)
if u'LOCATION' in tagsent:
loclist.extend(tagsent[u'LOCATION'])
sentence = ""
else:
sentence += char
return [x.encode('ascii').lower() for x in loclist]
In the terminal, this error is being thrown fairly frequently (and was thrown occasionally with the Adobe-converted files) when the code is run:
edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ? (U+F0B7, decimal: 61623)
Why is this happening and how can I fix it?
Here's an example document I've been working with for reference: http://www-wds.worldbank.org/external/default/WDSContentServer/WDSP/IB/2009/02/03/000350881_20090203110828/Rendered/PDF/432750PJPR0BR010P1028180Box0334125B.pdf
Side note: As you can see, not everything is in sentence-format, so ideally in the future my text mining program would also be able to recognize tables and such. But I'm new to this and very unaware of how to implement that, so I want to get a handle of this basic Named Entity Recognizer first. However, if you have any suggestions, I'm very open to all the help I can get.
Thanks so much in advance!