6

I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.

The problem is simple: How to extract every WikiData entity from a given text?

An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.

So far I've tried the following:

  1. Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
  2. Apply Porter Stemming where applicable.
  3. Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.

This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.

Suggestions?

I tagged the question Scala because I'm using Spark for the task.

habitats
  • 2,203
  • 2
  • 23
  • 31

1 Answers1

3

Some suggestions:

  • consider Stanford NER in comparison to OpenNLP to see how it compares on your corpus
  • I wonder at the value of stemming for most entity names
  • I suspect you might be losing information by dividing the task into discrete stages
  • although Wikidata is new, the task isn't, so you might look at papers for Freebase|DBpedia|Wikipedia entity recognition|disambiguation

In particular, DBpedia Spotlight is one system designed for exactly this task.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

Tom Morris
  • 10,490
  • 32
  • 53
  • Stemming is actually only done on nouns identified as non-proper and plural, which is not a majority. Thanks for the papers/pointing out dbpedia spotlight. I did not know about these. – habitats Feb 04 '16 at 08:37
  • 1
    DBpedia is actually linked to Wikidata, (for some reason I missed that), so I'll mark your answer as accepted, since I was able to use DBpedia spotlight to fetch DBpedia ID, and use Sparql+RDF to fetch the Wikidata ID's directly. – habitats Feb 09 '16 at 19:56
  • @habitats Would you mind going into more detail on how to link them? I am trying this right now, but often there is no direct link from the dbpedia entry to wikidata – glaserl Apr 12 '21 at 08:57