How to prepare training corpus for CRF model using CRFSuite

Question

I need data in the following format

(u'Melbourne', u'NP', u'B-LOC'),
 (u'(', u'Fpa', u'O'),
 (u'Australia', u'NP', u'B-LOC'),
 (u')', u'Fpt', u'O'),
 (u',', u'Fc', u'O'),

What i have is just txt file, I need this data for CRF model for NER task. I`m planning to use crf suite for python, but cant quite understand how to label training data. I can just pos-tag it, but how to add named entities, cause i need to label training data with 2 custom labels.

score 4 · Answer 1 · answered Dec 05 '16 at 13:32

4

If you want to train a CRF model then you need annotated data; for some tasks it is possible to rely on existing corpora, but if your task is new then you'll have to annotate entities yourselves. There are tools which can help, e.g. take a look at http://brat.nlplab.org/. GATE also has annotation tool built-in.

POS tags are often used as features, but they are not strictly required (and you should use many other features as well).

answered Dec 05 '16 at 13:32

Mikhail Korobov

21,908
8
73
65

yes, my task is domain specific, thank you, i`ll try those tools – Khrystyna Kosenko Dec 05 '16 at 13:47

score 1 · Answer 2 · edited May 23 '17 at 12:09

1

If you want to create your own training data with different entities instead of just Location or Person entities then you can refer to my answer Is it possible to train Stanford NER system to recognize more named entities types?

edited May 23 '17 at 12:09

Community

1
1

answered Dec 13 '16 at 11:21

Rohan Amrute

764
1
9
23

score 1 · Answer 3 · answered Jul 28 '17 at 20:15

1

Brat is an excellent way to annotate your new dataset. After annotating it, there needs to be a conversion from Standoff format that Brat outputs to the format that Stanford NER accepts.

answered Jul 28 '17 at 20:15

Abhimanyu

134
8

How to prepare training corpus for CRF model using CRFSuite

3 Answers3