Named entity recognition with NLTK or Stanford NER using custom corpus

Question

I am trying to train a NER model in Indian with custom NE (named entity) dictionary for chunking. I refer to NLTK and Stanford NER repectively:

NLTK

I found the nltk.chunk.named_entity.NEChunkParser nechunkparser able to train on a custom corpus. However, the format of training corpus was not specified in the documentation or the comment of the source code.

Where could I find some guide to the custom corpus for NER in NLTK?

Stanford NER

According to the question, the FAQ of Stanford NER gives direction of how to train a custom NER model.

One of the major concern is that default Stanford NER does not support Indian. So is it viable to feed an Indian NER corpus to the model?

The Stanford NER is able to be trained on any languages as long as the training corpus comply to the specified format. Besides, NLTK provides a nice (though some buggy) interface to use the trained Stanford NER tagger. — Zelong, Jan 14 '16 at 10:29

score 1 · Answer 1 · answered Jan 19 '16 at 13:12

Your Training corpus needs to be in a .tsv file extension.

The file should some what look like this:

John PER
works O
at O
Intel ORG

This is just for representation of the data as i do not know which Indian language you are targeting. But Your data must always be Tab Separated values. First will be the token and the other value its associated label.

I have tried NER by building my custom data (in English though) and have built a model.

So I guess its pretty much possible for Indian languages also.

Named entity recognition with NLTK or Stanford NER using custom corpus

1 Answers1