1

I need data in the following format

(u'Melbourne', u'NP', u'B-LOC'),
 (u'(', u'Fpa', u'O'),
 (u'Australia', u'NP', u'B-LOC'),
 (u')', u'Fpt', u'O'),
 (u',', u'Fc', u'O'),

What i have is just txt file, I need this data for CRF model for NER task. I`m planning to use crf suite for python, but cant quite understand how to label training data. I can just pos-tag it, but how to add named entities, cause i need to label training data with 2 custom labels.

Rohan Amrute
  • 764
  • 1
  • 9
  • 23

3 Answers3

4

If you want to train a CRF model then you need annotated data; for some tasks it is possible to rely on existing corpora, but if your task is new then you'll have to annotate entities yourselves. There are tools which can help, e.g. take a look at http://brat.nlplab.org/. GATE also has annotation tool built-in.

POS tags are often used as features, but they are not strictly required (and you should use many other features as well).

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
1

If you want to create your own training data with different entities instead of just Location or Person entities then you can refer to my answer Is it possible to train Stanford NER system to recognize more named entities types?

Community
  • 1
  • 1
Rohan Amrute
  • 764
  • 1
  • 9
  • 23
1

Brat is an excellent way to annotate your new dataset. After annotating it, there needs to be a conversion from Standoff format that Brat outputs to the format that Stanford NER accepts.

Abhimanyu
  • 134
  • 8