0

i'm using nltk as interface for Stanford NER Tagger. I have question that are there any options to get NER result as IOB format using NLTK? I've read this question but it's for java user

NLTK version: 3.4

Java version: jdk1.8.0_211/bin

Stanford NER model: english.conll.4class.distsim.crf.ser.gz

Input: My name is Donald Trumph

Expected output: My/O name/O is/O Donald/B-PERSON Trumph/I-PERSON

alvas
  • 115,346
  • 109
  • 446
  • 738
MaybeNextTime
  • 561
  • 5
  • 11
  • See https://stackoverflow.com/a/51981566/610569 and https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK – alvas May 24 '19 at 22:14
  • the main thing is how can i get IOB format with output ? – MaybeNextTime May 25 '19 at 01:50
  • See the NER Tagger portion of the answer... Read the answer `ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')`, https://stackoverflow.com/questions/13883277/stanford-parser-and-nltk/51981566#51981566 – alvas May 25 '19 at 04:08
  • I've already seen it, it's only return same tag PERSON for person name, but i want it return it with IOB format: Donald Trumph -> Donald/B-PERSON Trumph/I-PERSON – MaybeNextTime May 25 '19 at 06:51
  • Welcome to SO. Next time try to put some effort in explaining what you have tried. Otherwise people will not be responsive here on SO =) – alvas May 25 '19 at 11:31

1 Answers1

1

TL;DR

First see Stanford Parser and NLTK

Write a simple loop and iterate through the NER outputs:

def stanford_to_bio(tagged_sent):
    prev_tag = "O"
    bio_tagged_output = []
    current_ner = []
    for word, tag in tagged_sent:
        if tag == 'O':
            bio_tagged_output += current_ner
            bio_tagged_output.append((word, tag))
            current_ner = []
            prev_tag = 'O'
        else:
            if prev_tag == 'O':
                current_ner.append((word, 'B-'+tag))
                prev_tag = 'B'
            else:
                current_ner.append((word, 'I-'+tag))
                prev_tag = 'I'
    if current_ner:
        bio_tagged_output += current_ner
    return bio_tagged_output

tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]
stanford_to_bio(tagged_sent)

[out]:

[('Rami', 'B-PERSON'),
 ('Eid', 'I-PERSON'),
 ('is', 'O'),
 ('studying', 'O'),
 ('at', 'O'),
 ('Stony', 'B-ORGANIZATION'),
 ('Brook', 'I-ORGANIZATION'),
 ('University', 'I-ORGANIZATION'),
 ('in', 'O'),
 ('NY', 'B-STATE_OR_PROVINCE')]
alvas
  • 115,346
  • 109
  • 446
  • 738