Running CRFSuite examples

Question

I'm trying to use CRFSuite but I can't figure out how to use the example/ner.py and pos.py

Precisely, how do I make an input of the form:

# Ner.py
fields = 'y w pos chk'

or

# Pos.py
fields = 'w num cap sym p1 p2 p3 p4 s1 s2 s3 s4 y'

The "y w pos" I can get from a CoNNL model, for example, but the "chk" part and all those fields in pos.py I don't really get.

Also, is there a way to process a raw text (without all those tags) with CRFSuite given that I have a trained model?

I, too, am interested in solving this issue; Paticularly, starting from the cited CoNLL data (2000 for chunking, 2003 for NER, but what to use for PoS?), how do I generated the PoS data. As cited, the input has to be `'w num cap sym p1 p2 p3 p4 s1 s2 s3 s4 y'`, meaning the word itself first and the PoS tag last. But what is all the stuff in between and how to get and generate it? — fnl, Sep 25 '14 at 09:29
Maybe the question regarding the PoS part more precisely should be: How does one generate the PoS tagging input format from a regular, PoS tagged OWPL file (`"word tag\n"...`) using these scripts? — fnl, Sep 25 '14 at 09:39

score 2 · Answer 1 · answered Jul 17 '13 at 22:11

2

@michele is right. This task requires another dataset. I believe the datasets are here: http://www.cnts.ua.ac.be/conll2003/ner/

answered Jul 17 '13 at 22:11

Legend

113,822
119
272
400

score 1 · Answer 2 · edited Feb 21 '12 at 15:14

1

You cannot use ner.py or pos.py with the data provided by the author of the tutorial. You need a proper CoNLL-2000 data set. :)

Just as an example, you can find it here

I hope I have replied correctly to your question.

edited Feb 21 '12 at 15:14

Matt Fenwick

48,199
22
128
192

answered Feb 21 '12 at 15:09

user_1177868

414
4
18

Yes, sort of, but how to generate the CRFsuite input data for PoS tagging? I.e., where is the original data set that is used to generate training/test files using the PoS template with the fields as shown in the question? (The ner.py file shows what each field actually means (num, cap, sym, p1-4, and s1-4).) – fnl Sep 25 '14 at 09:32
To get CoNLL-2000 data set (English): `import nltk;train_sents = list(nltk.corpus.conll2000.iob_sents('train.txt')); test_sents = list(nltk.corpus.conll2000.iob_sents('test.txt'))` – Franck Dernoncourt Sep 13 '15 at 03:44

score 0 · Answer 3 · answered Sep 25 '14 at 10:07

It turned out it is simpler to slightly modify the pos.py file to do what it should be doing. Now the input format for pos.py is 'w y', while the features 'num cap sym p1 p2 p3 p4 s1 s2 s3 s4' are all generated by the script itself. This should solve the pos.py issues. Here is the gist:

https://gist.github.com/fnl/21116fa57527946c5dbe

As for the ner.py script, as answered by @Legend already, the relevant input data format can be found, for example, here:

http://www.cnts.ua.ac.be/conll2003/ner/

Running CRFSuite examples

3 Answers3