What created `maxent_treebank_pos_tagger/english.pickle`?

Question

The nltk package's built-in part-of-speech tagger does not seem to be optimized for my use-case (here, for instance). The source code here shows that it's using a saved, pre-trained classifier called maxent_treebank_pos_tagger.

What created maxent_treebank_pos_tagger/english.pickle? I'm guessing that there is a tagged corpus out there somewhere that was used to train this tagger, so I think I'm looking for (a) that tagged corpus and (b) the exact code that trains the tagger based on the tagged corpus.

In addition to lots of googling, so far I tried to look at the .pickle object directly to find any clues inside it, starting like this

from nltk.data import load
x = load("nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle")
dir(x)

Not entirely sure, but I believe the corpus used is the [Penn Treebank](https://www.cis.upenn.edu/~treebank/) — Igor, Jul 13 '15 at 15:00
@Igor, the source code I liked to above seems to agree. Unfortunately, it seems like maybe the Penn Treebank data is not free to the public, which would mostly answer my question: https://catalog.ldc.upenn.edu/LDC99T42 — zkurtz, Jul 13 '15 at 17:50

score 6 · Answer 1 · edited May 23 '17 at 11:58

6

The NLTK source is https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L83

The original source of NLTK's MaxEnt POS tagger is from https://github.com/arne-cl/nltk-maxent-pos-tagger

Training Data: Wall Street Journal subset of the Penn Tree bank corpus

Features: Ratnaparki (1996)

Algorithm: Maximum Entropy

Accuracy: What is the accuracy of nltk pos_tagger?

edited May 23 '17 at 11:58

Community

1
1

answered Jul 13 '15 at 20:35

alvas

115,346
109
446
738

Your second link ( https://github.com/arne-cl/nltk-maxent-pos-tagger) is the part that seems to directly address my question. How do you know that this is the same `nltk-maxent-pos-tagger` as shows up in the official `nltk` package? – zkurtz Jul 14 '15 at 14:01
Why not raise an issue on the nltk github as well? – b3000 Aug 02 '15 at 12:18

What created `maxent_treebank_pos_tagger/english.pickle`?

1 Answers1

Linked