3

I have tried my hands on many NER tools (OpenNLP, Stanford NER, LingPipe, Dbpedia Spotlight etc).

But what has constantly evaded me is a gazetteer/dictionary based NER system where my free text is matched with a list of pre-defined entity names, and potential matches are returned.

This way I could have various lists like PERSON, ORGANIZATION etc. I could dynamically change the lists and get different extractions. This would tremendously decrease training time (since most of them are based on maximum entropy model so they generally includes tagging a large dataset, training the model etc).

I have built a very crude gazetteer based NER system using a OpenNLP POS tagger, from which I used to take out all the Proper nouns (NP) and then look them up in a Lucene index created from my lists. This however gives me a lot of false positives. For ex. if my lucene index has "Samsung Electronics" and my POS tagger gives me "Electronics" as a NP, my approach would return me "Samsung Electronics" since I am doing partial matches.

I have also read people talking about using gazetteer as a feature in CRF algorithms. But I never could understand this approach.

Can any of you guide me towards a clear and solid approach that builds NER on gazetteer and dictionaries?

Vini
  • 313
  • 1
  • 7
  • 21

2 Answers2

6

I'll try to make the use of gazetteers more clear, as I suspect this is what you are looking for. Whatever training algorithm used (CRF, maxent, etc.) they take into account features, which are most of the time:

  • tokens
  • part of speech
  • capitalization
  • gazetteers
  • (and much more)

Actually gazetteers features provide the model with intermediary information that the training step will take into account, without explicitly being dependent on the list of NEs present in the training corpora. Let's say you have a gazetteer about sport teams, once the model is trained you can expand the list as much as you want without training the model again. The model will consider any listed sport team as... a sport team, whatever its name.

In practice:

  1. Use any NER or ML-based framework
  2. Decide what gazetteers are useful (this is maybe the most crucial part)
  3. Affect to each gazetteer a relevant tag (e.g. sportteams, companies, cities, monuments, etc.)
  4. Populate gazetteers with large lists of NEs
  5. Make your model take into account those gazetteers as features
  6. Train a model on a relevant corpus (it should containing many NEs from gazetteers)
  7. Update your list as much as you want

Hope this helps!

eldams
  • 700
  • 6
  • 14
1

You can try this minimal bash Named-Entity Recognizer: https://github.com/lasigeBioTM/MER Demo: http://labs.fc.ul.pt/mer/

FCouto
  • 66
  • 4