I have tried my hands on many NER tools (OpenNLP, Stanford NER, LingPipe, Dbpedia Spotlight etc).
But what has constantly evaded me is a gazetteer/dictionary based NER system where my free text is matched with a list of pre-defined entity names, and potential matches are returned.
This way I could have various lists like PERSON, ORGANIZATION etc. I could dynamically change the lists and get different extractions. This would tremendously decrease training time (since most of them are based on maximum entropy model so they generally includes tagging a large dataset, training the model etc).
I have built a very crude gazetteer based NER system using a OpenNLP POS tagger, from which I used to take out all the Proper nouns (NP) and then look them up in a Lucene index created from my lists. This however gives me a lot of false positives. For ex. if my lucene index has "Samsung Electronics" and my POS tagger gives me "Electronics" as a NP, my approach would return me "Samsung Electronics" since I am doing partial matches.
I have also read people talking about using gazetteer as a feature in CRF algorithms. But I never could understand this approach.
Can any of you guide me towards a clear and solid approach that builds NER on gazetteer and dictionaries?