Efficiently extract WikiData entities from text

Question

I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.

The problem is simple: How to extract every WikiData entity from a given text?

An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.

So far I've tried the following:

Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
Apply Porter Stemming where applicable.
Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.

This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.

Suggestions?

I tagged the question Scala because I'm using Spark for the task.

score 3 · Accepted Answer · answered Feb 04 '16 at 05:02

3

Some suggestions:

consider Stanford NER in comparison to OpenNLP to see how it compares on your corpus
I wonder at the value of stemming for most entity names
I suspect you might be losing information by dividing the task into discrete stages
although Wikidata is new, the task isn't, so you might look at papers for Freebase|DBpedia|Wikipedia entity recognition|disambiguation

In particular, DBpedia Spotlight is one system designed for exactly this task.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

answered Feb 04 '16 at 05:02

Tom Morris

10,490
32
53

Stemming is actually only done on nouns identified as non-proper and plural, which is not a majority. Thanks for the papers/pointing out dbpedia spotlight. I did not know about these. – habitats Feb 04 '16 at 08:37
1

DBpedia is actually linked to Wikidata, (for some reason I missed that), so I'll mark your answer as accepted, since I was able to use DBpedia spotlight to fetch DBpedia ID, and use Sparql+RDF to fetch the Wikidata ID's directly. – habitats Feb 09 '16 at 19:56
@habitats Would you mind going into more detail on how to link them? I am trying this right now, but often there is no direct link from the dbpedia entry to wikidata – glaserl Apr 12 '21 at 08:57

Efficiently extract WikiData entities from text

1 Answers1