NLTK word_tokenize on French text is not woking properly

Question

I'm trying to use NLTK word_tokenize on a text in French by using :

txt = ["Le télétravail n'aura pas d'effet sur ma vie"]
print(word_tokenize(txt,language='french'))

it should print:

['Le', 'télétravail', 'n'','aura', 'pas', 'd'','effet', 'sur', 'ma', 'vie','.']

But I get:

['Le', 'télétravail', "n'aura", 'pas', "d'effet", 'sur', 'ma', 'vie','.']

Does anyone know why it's not spliting tokens properly in French and how to overcome this (and other potential issues) when doing NLP in French?

vkoe · Answer 1 · 2017-11-20T07:18:35.090

3

Looking at the source of word_tokenize reveals, that the language argument is only used to determine how to split the input into sentences. And for tokenization on word level, a (slightly modified) TreebankWordTokenizer is used which will work best for english input and contractions like don't. From nltk/tokenize/__init__.py:

_treebank_word_tokenizer = TreebankWordTokenizer()
# ... some modifications done
def word_tokenize(text, language='english', preserve_line=False):
    # ...
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

To get your desired output, you might want to consider using a different tokenizer like a RegexpTokenizer as following:

txt = "Le télétravail n'aura pas d'effet sur ma vie"
pattern = r"[dnl]['´`]|\w+|\$[\d\.]+|\S+"
tokenizer = RegexpTokenizer(pattern)
tokenizer.tokenize(txt)
# ['Le', 'télétravail', "n'", 'aura', 'pas', "d'", 'effet', 'sur', 'ma', 'vie']

My knowledge of French is limited and this only solves the stated problem. For other cases you will have to adapt the pattern. You can also look at the implementation of the TreebankWordTokenizer for ideas of a more complex solution. Also keep in mind that this way you will need to split sentences beforehand, if necessary.

edited Nov 20 '17 at 07:18

answered Nov 19 '17 at 15:20

vkoe

381
4
12

Thank you so much for your help and explanations! appreciated! – JB5778 Nov 19 '17 at 17:07
I tried and I don't get your solution. With the pattern you wrote, I get the following : ['Le', 'télétravail', 'n', '’aura', 'pas', 'd', '’effet', 'sur', 'ma', 'charge', 'de', 'travail', '.'] The apostrophy is attached to the second word instead of the first (for "n'aura" and "d'effet"). – JB5778 Nov 19 '17 at 17:46
I'm using nltk version 3.2.5. Did you copy the pattern from my posted snippet? The order of the groups in the regex is actually important, the `[dnl]'` should be the first part of the pattern. What I wrote is only a quick example and surely can be improved. – vkoe Nov 19 '17 at 19:15
Hello, yes I copied/pasted your pattern. will try to learn more about patterns handeling in Python (am not very good at it). I'm using the most updated version of nltk. Thanks again a lot for showing me the way. – JB5778 Nov 19 '17 at 21:09
Are you using python2? I did this in python3 and this has an impact on how strings are treated. I also edited the snippet in my answer to account for different kinds of apostrophes. Probably the easiest solution is to go with the `MosesTokenizer` suggested by alvas. I wasn't aware of its implementation in nltk. – vkoe Nov 20 '17 at 07:16
Thanks VaID, it worked nicely now (using Python3) :-) Probably indeed a question of the different types of apostrophes. Will have a look at MosesTokenizer. – JB5778 Nov 20 '17 at 07:57

score 2 · Answer 2 · answered Nov 20 '17 at 01:48

2

I don't think there's an explicit French model for word_tokenize (which is the modified treebank tokenizer used for the English Penn Treebank). '

The word_tokenize function performs sentence tokenization using the sent_tokenize function before the actual word tokenization. The language argument in word_tokenize is only used for the sent_tokenize part.

Alternatively, you can use the MosesTokenizer that has some language dependent regexes (and it does support French):

>>> from nltk.tokenize.moses import MosesTokenizer
>>> moses = MosesTokenizer(lang='fr')
>>> sent = u"Le télétravail n'aura pas d'effet sur ma vie"
>>> moses.tokenize(sent)
[u'Le', u't\xe9l\xe9travail', u'n&apos;', u'aura', u'pas', u'd&apos;', u'effet', u'sur', u'ma', u'vie']

If you want don't like it that Moses escape special XML characters, you can do:

>>> moses.tokenize(sent, escape=False)
[u'Le', u't\xe9l\xe9travail', u"n'", u'aura', u'pas', u"d'", u'effet', u'sur', u'ma', u'vie']

To explain why splitting n' and d' is useful in French NLP.

Linguistically, separating the n' and d' does make sense because they're clitiques that have their own syntactic and semantic properties but bounded to the root/host.

In French, ne ... pas would have been a discontinuous constituent to denote negation, the clitique nature of ne going to n' is because of the vowel onset in the word following ne, so splitting the n' from the aura does make it easier to identify ne ... pas.

In the case of d', it's the same phonetic motivation of the vowel onset in the following word to go from de effet -> d'effet.

answered Nov 20 '17 at 01:48

alvas

115,346
109
446
738

yes, thank you for your reply. linguistically "n' ....pas" equals "ne pas" in French (discontinuous constituent). In the case of " d' " of "d'effet", "d'" should be considered as a stopword (as it does not bring any interesting info about the meaning of the sentence). However after checking it using print (stopwords.words('french')), I realized "d'" is not tagged as a stop word in nltk.corpus. ( "de" is) . That's a problem. Do you know by any chance how I could add stopwords to the current stopwords list in French in nltk.corpus? – JB5778 Nov 20 '17 at 11:03
Yes, adding the clitique form should be possible. Please create an issue https://github.com/nltk/nltk_data as a feature request. – alvas Nov 20 '17 at 11:06
Do you know by anychance if (n'.....pas) is considered as a negation group in the current implementation? (equivalent to "ne....pas"). – JB5778 Nov 20 '17 at 11:06
No it isn't. Discontinuous constituents are infamously hard to catch. – alvas Nov 20 '17 at 11:08
ok, I realized after further investigation that "d" is a stopword in nltk.copus (french) but "d' " is not. That is odd... I would have expected to have "d'" with apostrophy as a stopword and not just the letter "d"... Does that mean that I should tokenize my sentence this way : ['Le', 'télétravail', 'n','aura', 'pas', 'd','effet', 'sur', 'ma', 'vie', ' `', '.' ] with the apotrophy seperately ? – JB5778 Nov 20 '17 at 11:31
MosesTokenizer does the work. The only issue is escape = false does not seem to escape special XML characters (I opened an issue about it on github) I get the same result on moses.tokenize(sent, escape=False) an moses.tokenize(sent, escape=True ) ==> ['Le', 'télétravail', 'n'', 'aura', 'pas', 'd'', 'effet', 'sur', 'ma', 'vie'] (Python 3) - Do you have an idea why the escape flag is not working ? – JB5778 Nov 20 '17 at 14:59
1

According to https://www.nltk.org/news.html?highlight=moses , Moses tokenizer has been removed... – Gabriel Romon Jun 27 '18 at 14:31

Claude COULOMBE · Answer 3 · 2020-05-24T06:51:30.760

Here we see that the processing of the French elision is not satisfactory. So, I recommend correcting the problem with post-processing of elision from the NLTK word_tokenize output.

compiled_pattern = re.compile(r"([a-zA-ZÀ-Ÿ]+['’])([a-zA-ZÀ-Ÿ]*)")

def split_in_words_fr(text):
    tokens = word_tokenize(text)
    new_tokens = []
    for token in tokens:
        search_results = re.findall(r"['’]",token)
        if search_results and len(search_results) == 1:
            new_tokens.extend(re.split(compiled_pattern,token)[1:3])
        else:
            new_tokens.append(token)
    return new_tokens

Then:

print(split_in_words_fr("Le télétravail n'aura pas d'effet sur ma vie"))

gives:

['Le', 'télétravail', "n'", 'aura', 'pas', "d'", 'effet', 'sur', 'ma', 'vie']

A less satisfactory solution is to use wordpunct_tokenize which splits on every non alphanum characters.

from nltk.tokenize import wordpunct_tokenize

print(wordpunct_tokenize("Le télétravail n'aura pas d'effet sur ma vie"))

which gives

['Le', 'télétravail', 'n', "'", 'aura', 'pas', 'd', "'", 'effet', 'sur', 'ma', 'vie']

NLTK word_tokenize on French text is not woking properly

3 Answers3

Linked