Chunking with Python-Treetaggerwrapper

Question

The Treetagger can do POS-tagging as well as text-chunking, which means extracting verbal and nominal clauses, as in this German example:

$ echo 'Das ist ein Test.' | cmd/tagger-chunker-german
    reading parameters ...
    tagging ...
     finished.
<NC>
Das PDS die
</NC>
<VC>
ist VAFIN   sein
</VC>
<NC>
ein ART eine
Test    NN  Test
</NC>
.   $.  .

I'm trying to figure out how to do this with the Treetaggerwrapper in Python (since it's faster than directly calling Treetagger), but I can't figure out how it's done. The documentation refers to chunking as preprocessing, so I tried using this:

tags = tagger.tag_text(u"Dieser Satz ist ein Satz.",prepronly=True)

But the output is just a list of the words with no information added. I'm starting to think that what the Wrapper calls Chunking is something different than what the actual tagger calls Chunking, but maybe I'm just missing something? Any help would be appreciated.

score 2 · Accepted Answer · answered Sep 23 '16 at 14:19

The original poster is right in his assumptions. treetaggerwrapper (as of version 2.2.4) defines chunking as merely "preprocessing of text", and does not fully wrap TreeTagger's capabilities in this sense. From treetaggerwrapper.py:

Manage preprocessing of text (chunking) in place of external Perl scripts as in base TreeTagger installation, thus avoid starting Perl each time a piece of text must be tagged.

But inspecting tagger-chunker-german one can see that getting clauses and tags is a string of operations, actually calling TreeTagger 3 times:

$ echo 'Das ist ein Test.' | cmd/tree-tagger-german | perl -nae 'if ($#F==0){print} else {print "$F[0]-$F[1]\n"}' | bin/tree-tagger lib/german-chunker.par -token -sgml -eps 0.00000001 -hyphen-heuristics -quiet | cmd/filter-chunker-output-german.perl | bin/tree-tagger -quiet -token -lemma -sgml lib/german-utf8.par

whereas treetaggerwrapper's tagging command (shown in tagcmdlist) is actually a one-shot call (after it's own preprocessing of the text) to:

bin/tree-tagger -token -lemma -sgml -quiet -no-unknown lib/german-utf8.par

The point of entry to extend it for chunking is the line

"tagparfile": "german-utf8.par",

where you would define something like

"chunkingparfile": "german-chunker.par",

and issue an additional call to TreeTagger with this other parfile following the tagger-chunker-german operation chain. You'd then probably still have to copy some extra logic from cmd/filter-chunker-output-german.perl though.

score 1 · Answer 2 · answered Jun 21 '16 at 11:17

It would be easier with a full code example, please provide one for further questions, but I give it a try. The TreeTaggerWrapper Documentation has a nice example:

>>> import pprint   # For proper print of sequences.
>>> import treetaggerwrapper
>>> #1) build a TreeTagger wrapper:
>>> tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
>>> #2) tag your text.
>>> tags = tagger.tag_text("This is a very short text to tag.")
>>> #3) use the tags list... (list of string output from TreeTagger).
>>> pprint.pprint(tags)
['This\tDT\tthis',
 'is\tVBZ\tbe',
 'a\tDT\ta',
 'very\tRB\tvery',
 'short\tJJ\tshort',
 'text\tNN\ttext',
 'to\tTO\tto',
 'tag\tVV\ttag',
 '.\tSENT\t.']
>>> # Note: in output strings, fields are separated with tab chars (\t).

Please note that this is an example for Python 3 because the text has no u in front of it to declare it as Unicode. This is because Python 3 takes Unicode as default while Python 2.7 needs it declared like in your post. Which brings up the question which Python Version you are using.

Chunking

Chunking is tagging of multi-token sequences, e.g. The yellow dog:

Word -> POS-Tag
The -> DT (article)
yellow -> JJ (adjective)
dog -> NN (noun)

All three words together are a chunk and will be tagged as NP (noun phrase).

score 0 · Answer 3 · answered May 04 '17 at 10:28

I think in the treetaggerwrapper he's using the binary file to do the tagging task, since I found this in treetaggerwrapper.py:

    # ----- Set binary by platform.
    if ON_WINDOWS:
        self.tagbin = os.path.join(self.tagbindir, "tree-tagger.exe")
    elif ON_MACOSX or ON_POSIX:
        self.tagbin = os.path.join(self.tagbindir, "tree-tagger")

Then the answer is quite obvious, the treetagger library itself doesn't provide binary file for chunker leads to the fact that treetaggerwrapper and also another library "treetagger-python" don't have chunk function.

Chunking with Python-Treetaggerwrapper

3 Answers3