NLTK Named Entity Recognition with Custom Data

Question

I'm trying to extract named entities from my text using NLTK. I find that NLTK NER is not very accurate for my purpose and I want to add some more tags of my own as well. I've been trying to find a way to train my own NER, but I don't seem to be able to find the right resources. I have a couple of questions regarding NLTK-

Can I use my own data to train an Named Entity Recognizer in NLTK?
If I can train using my own data, is the named_entity.py the file to be modified?
Does the input file format have to be in IOB eg. Eric NNP B-PERSON ?
Are there any resources - apart from the nltk cookbook and nlp with python that I can use?

I would really appreciate help in this regard

score 24 · Answer 1 · answered Jul 09 '12 at 16:16

24

Are you committed to using NLTK/Python? I ran into the same problems as you, and had much better results using Stanford's named-entity recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml. The process for training the classifier using your own data is very well-documented in the FAQ.

If you really need to use NLTK, I'd hit up the mailing list for some advice from other users: http://groups.google.com/group/nltk-users.

Hope this helps!

answered Jul 09 '12 at 16:16

jjdubs

316
1
5

1

Browsing through the SNER site, I saw that there's even a python interface [here](https://github.com/dat/pyner). Not sure how mature it is, but it might be helpful. – senderle Jul 09 '12 at 20:05
6

I had the same problem and shared what worked for me. Sorry if that upset you bro :( – jjdubs Sep 04 '12 at 22:13
7

The Stanford NER has been included in NLTK 2.0. Read More - http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford – Jayesh Feb 16 '14 at 11:53
3

Guys, here I wrote script to download and prepare all required to get Python, NLTK and Stanford NER work together -- https://gist.github.com/troyane/c9355a3103ea08679baf – NG_ Jun 09 '14 at 10:51
does anyone know how to use the python -stanford NER interface to train on new corpuses? – user3314418 Aug 05 '14 at 15:13
@blueblank There's really no point in using NLTK anyways if you're trying to use it for production purpose. Stanford CoreNLP is already much better as a toolkit with traditional ML approaches, not to mention you have all the deep learning stuffs already used in production nowadays. – xji Jul 24 '18 at 18:28

score 14 · Answer 2 · answered Nov 25 '15 at 05:45

14

You can easily use the Stanford NER alongwith nltk. The python script is like

from nltk.tag.stanford import NERTagger
import os
java_path = "/Java/jdk1.8.0_45/bin/java.exe"
os.environ['JAVAHOME'] = java_path
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())

To train your own data and to create a model you can refer to the first question on Stanford NER FAQ.

The link is http://nlp.stanford.edu/software/crf-faq.shtml

answered Nov 25 '15 at 05:45

Rohan Amrute

764
1
9
23

The FAQ answer uses java , is there any way to do it in python? – Kitwradr Jun 06 '19 at 09:28
I do not know if we can train the model using python. But we can cal the java modules via python. – Rohan Amrute Jun 12 '19 at 11:33
The only way to predict from ner zip is to convert the input to TSV ? Is there java command in which we can give the text directly – Kitwradr Jun 12 '19 at 11:35

score 1 · Answer 3 · answered Oct 02 '17 at 15:46

I also had this issue, but I managed to work it out. You can use your own training data. I documented the main requirements/steps for this in my github repository.

I used NLTK-trainer, so basicly you have to get the training data in the right format (token NNP B-tag), and run the training script. Check my repository for more info.

score 1 · Answer 4 · answered Jul 29 '18 at 22:24

There are some functions in the nltk.chunk.named_entity module that train a NER tagger. However, they were specifically written for ACE corpus and not totally cleaned up, so one will need to write their own training procedures with those as a reference.

There are also two relatively recent guides (1 2) online detailing the process of using NLTK to train the GMB corpus.

However, as mentioned in answers above, now that many tools are available, one really should not need to resort to NLTK if streamlined training process is desired. Toolkits such as CoreNLP and spaCy do a much better job. As using NLTK is not that much different to writing your own training code from scratch, there is not that much value in doing so. NLTK and OpenNLP can be regarded as somehow belonging to a past era before the explosion of recent progress in NLP.

score 0 · Answer 5 · answered Feb 25 '21 at 21:58

Are there any resources - apart from the nltk cookbook and nlp with python that I can use?

You can consider using spaCy to train your own custom data for NER task. Here is an example from this thread to train a model on a custom training set to detect a new entity ANIMAL. The code was fixed and updated for easier reading.

import random
import spacy
from spacy.training import Example

LABEL = 'ANIMAL'
TRAIN_DATA = [
    ("Horses are too tall and they pretend to care about your feelings", {'entities': [(0, 6, LABEL)]}),
    ("Do they bite?", {'entities': []}),
    ("horses are too tall and they pretend to care about your feelings", {'entities': [(0, 6, LABEL)]}),
    ("horses pretend to care about your feelings", {'entities': [(0, 6, LABEL)]}),
    ("they pretend to care about your feelings, those horses", {'entities': [(48, 54, LABEL)]}),
    ("horses?", {'entities': [(0, 6, LABEL)]})
]
nlp = spacy.load('en_core_web_sm')  # load existing spaCy model
ner = nlp.get_pipe('ner')
ner.add_label(LABEL)

optimizer = nlp.create_optimizer()

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes):  # only train NER
    for itn in range(20):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in TRAIN_DATA:
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            nlp.update([example], drop=0.35, sgd=optimizer, losses=losses)
        print(losses)

# test the trained model
test_text = 'Do you like horses?'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
    print(ent.label_, " -- ", ent.text)

Here are the outputs:

{'ner': 9.60289144264557}
{'ner': 8.875474230820478}
{'ner': 6.370401408220459}
{'ner': 6.687456469517201}
... 
{'ner': 1.3796682589133492e-05}
{'ner': 1.7709562613218738e-05}

Entities in 'Do you like horses?'
ANIMAL  --  horses

would you mind stating what version of spacy was used for this? the method add_label throws an error in my case — Daniel, Oct 19 '21 at 14:59

score 0 · Answer 6 · answered Mar 06 '21 at 16:09

0

To complete the answer by @Thang M. Pham, you need to label your data before training. To do so, you can use the spacy-annotator.

Here is an example taken from another answer: Train Spacy NER on Indian Names

answered Mar 06 '21 at 16:09

iEriii

403
2
7

NLTK Named Entity Recognition with Custom Data

6 Answers6

Linked