4

how to use gazetteers or dictionaries as features in CRF++?

To elaborate: suppose I want to do NER on person names, and I am having a gazetteer (or dictionary) containing commonly seen person names, I want to use this gazetteer as an input to crf++, how can I do that?

I am using the conditional random field package crf++ to perform named entity recognition tasks. I know how to represent some commonly used features in crf++. For example, if we want to use Capitalization as a feature, we can add one separate column in the feature template of crf indicating if a word is capitalized or not.

DehengYe
  • 619
  • 2
  • 8
  • 22

1 Answers1

6

You could make a new feature that indicates if a token is in the dictionary/gazeteer. Just check for set membership and set the Gazeteer feature to 1 or 0.

HugoMailhot
  • 1,275
  • 1
  • 10
  • 19
  • you mean during training? – DehengYe Oct 19 '15 at 03:46
  • may i know more details? – DehengYe Oct 19 '15 at 03:46
  • 2
    The feature extraction has to be the same during training and tagging, else you are not feeding your model what it is expecting. The same way you add a separate column in the feature template to indicate if a word is capitalized or not, you could add another column to say if a given word is present in your gazeeter/dictionary or not. Let's assume a gazeeter containing only 'john' and 'mary'. Using the two features (Capitalized, InGazeeter) with the sequence "John loves mary", you would get (1,0), (0,0), (0,1). Of course, a real model would use a wider variety of features. – HugoMailhot Oct 19 '15 at 07:09
  • thank you. I know we can have many other features. let's focus using the gazetteer feature. Continuing your example, let's assume a gazetteer containing only 'john', 'mary' and 'jack', and I am using one column indicating if the current token is in my gazetteer or not. Using the gazetteer feature to train the sequence 'john loves mary', I will get 1, 0 1. This is for training. If I want to tag the sequence 'jack loves john', now jack is in my gazetteer, how to let crf++ know my gazetteer during tagging? – DehengYe Oct 20 '15 at 13:07
  • Thank you for your help. I managed to insert gazetteers as features using simple string matching. But my gazetteer contains many common english words, which seems not helpful to crf tagging performance. Is it possible for us to have a talk regarding the performance of using gazetteers as features? – DehengYe Oct 28 '15 at 02:28