I don't think there's an explicit French model for word_tokenize
(which is the modified treebank tokenizer used for the English Penn Treebank). '
The word_tokenize
function performs sentence tokenization using the sent_tokenize
function before the actual word tokenization. The language
argument in word_tokenize
is only used for the sent_tokenize
part.
Alternatively, you can use the MosesTokenizer
that has some language dependent regexes (and it does support French):
>>> from nltk.tokenize.moses import MosesTokenizer
>>> moses = MosesTokenizer(lang='fr')
>>> sent = u"Le télétravail n'aura pas d'effet sur ma vie"
>>> moses.tokenize(sent)
[u'Le', u't\xe9l\xe9travail', u'n'', u'aura', u'pas', u'd'', u'effet', u'sur', u'ma', u'vie']
If you want don't like it that Moses escape special XML characters, you can do:
>>> moses.tokenize(sent, escape=False)
[u'Le', u't\xe9l\xe9travail', u"n'", u'aura', u'pas', u"d'", u'effet', u'sur', u'ma', u'vie']
To explain why splitting n'
and d'
is useful in French NLP.
Linguistically, separating the n'
and d'
does make sense because they're clitiques that have their own syntactic and semantic properties but bounded to the root/host.
In French, ne ... pas
would have been a discontinuous constituent to denote negation, the clitique nature of ne
going to n'
is because of the vowel onset in the word following ne
, so splitting the n'
from the aura
does make it easier to identify ne ... pas
.
In the case of d'
, it's the same phonetic motivation of the vowel onset in the following word to go from de effet
-> d'effet
.