How to create a gazetteer based Named Entity Recognition(NER) system?

Question

I have tried my hands on many NER tools (OpenNLP, Stanford NER, LingPipe, Dbpedia Spotlight etc).

But what has constantly evaded me is a gazetteer/dictionary based NER system where my free text is matched with a list of pre-defined entity names, and potential matches are returned.

This way I could have various lists like PERSON, ORGANIZATION etc. I could dynamically change the lists and get different extractions. This would tremendously decrease training time (since most of them are based on maximum entropy model so they generally includes tagging a large dataset, training the model etc).

I have built a very crude gazetteer based NER system using a OpenNLP POS tagger, from which I used to take out all the Proper nouns (NP) and then look them up in a Lucene index created from my lists. This however gives me a lot of false positives. For ex. if my lucene index has "Samsung Electronics" and my POS tagger gives me "Electronics" as a NP, my approach would return me "Samsung Electronics" since I am doing partial matches.

I have also read people talking about using gazetteer as a feature in CRF algorithms. But I never could understand this approach.

Can any of you guide me towards a clear and solid approach that builds NER on gazetteer and dictionaries?

score 6 · Answer 1 · answered Jul 20 '17 at 13:09

I'll try to make the use of gazetteers more clear, as I suspect this is what you are looking for. Whatever training algorithm used (CRF, maxent, etc.) they take into account features, which are most of the time:

tokens
part of speech
capitalization
gazetteers
(and much more)

Actually gazetteers features provide the model with intermediary information that the training step will take into account, without explicitly being dependent on the list of NEs present in the training corpora. Let's say you have a gazetteer about sport teams, once the model is trained you can expand the list as much as you want without training the model again. The model will consider any listed sport team as... a sport team, whatever its name.

In practice:

Use any NER or ML-based framework
Decide what gazetteers are useful (this is maybe the most crucial part)
Affect to each gazetteer a relevant tag (e.g. sportteams, companies, cities, monuments, etc.)
Populate gazetteers with large lists of NEs
Make your model take into account those gazetteers as features
Train a model on a relevant corpus (it should containing many NEs from gazetteers)
Update your list as much as you want

Hope this helps!

score 1 · Answer 2 · answered Jul 21 '17 at 07:57

1

You can try this minimal bash Named-Entity Recognizer: https://github.com/lasigeBioTM/MER Demo: http://labs.fc.ul.pt/mer/

answered Jul 21 '17 at 07:57

FCouto

66
4

How to create a gazetteer based Named Entity Recognition(NER) system?

2 Answers2