I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.
The problem is simple: How to extract every WikiData entity from a given text?
An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.
So far I've tried the following:
- Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
- Apply Porter Stemming where applicable.
- Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.
This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.
Suggestions?
I tagged the question Scala because I'm using Spark for the task.