5

I'm trying to use CRFSuite but I can't figure out how to use the example/ner.py and pos.py

Precisely, how do I make an input of the form:

# Ner.py
fields = 'y w pos chk'

or

# Pos.py
fields = 'w num cap sym p1 p2 p3 p4 s1 s2 s3 s4 y'

The "y w pos" I can get from a CoNNL model, for example, but the "chk" part and all those fields in pos.py I don't really get.

Also, is there a way to process a raw text (without all those tags) with CRFSuite given that I have a trained model?

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
user1079319
  • 51
  • 1
  • 3
  • I, too, am interested in solving this issue; Paticularly, starting from the cited CoNLL data (2000 for chunking, 2003 for NER, but what to use for PoS?), how do I generated the PoS data. As cited, the input has to be `'w num cap sym p1 p2 p3 p4 s1 s2 s3 s4 y'`, meaning the word itself first and the PoS tag last. But what is all the stuff in between and how to get and generate it? – fnl Sep 25 '14 at 09:29
  • Maybe the question regarding the PoS part more precisely should be: How does one generate the PoS tagging input format from a regular, PoS tagged OWPL file (`"word tag\n"...`) using these scripts? – fnl Sep 25 '14 at 09:39

3 Answers3

2

@michele is right. This task requires another dataset. I believe the datasets are here: http://www.cnts.ua.ac.be/conll2003/ner/

Legend
  • 113,822
  • 119
  • 272
  • 400
1

You cannot use ner.py or pos.py with the data provided by the author of the tutorial. You need a proper CoNLL-2000 data set. :)

Just as an example, you can find it here

I hope I have replied correctly to your question.

Matt Fenwick
  • 48,199
  • 22
  • 128
  • 192
user_1177868
  • 414
  • 4
  • 18
  • Yes, sort of, but how to generate the CRFsuite input data for PoS tagging? I.e., where is the original data set that is used to generate training/test files using the PoS template with the fields as shown in the question? (The ner.py file shows what each field actually means (num, cap, sym, p1-4, and s1-4).) – fnl Sep 25 '14 at 09:32
  • To get CoNLL-2000 data set (English): `import nltk;train_sents = list(nltk.corpus.conll2000.iob_sents('train.txt')); test_sents = list(nltk.corpus.conll2000.iob_sents('test.txt'))` – Franck Dernoncourt Sep 13 '15 at 03:44
0

It turned out it is simpler to slightly modify the pos.py file to do what it should be doing. Now the input format for pos.py is 'w y', while the features 'num cap sym p1 p2 p3 p4 s1 s2 s3 s4' are all generated by the script itself. This should solve the pos.py issues. Here is the gist:

https://gist.github.com/fnl/21116fa57527946c5dbe

As for the ner.py script, as answered by @Legend already, the relevant input data format can be found, for example, here:

http://www.cnts.ua.ac.be/conll2003/ner/

fnl
  • 4,861
  • 4
  • 27
  • 32