7

I'm using NLTK and TextBlob to find nouns and noun phrases in a text:

from textblob import TextBlob 
import nltk

blob = TextBlob(text)
print(blob.noun_phrases)
tokenized = nltk.word_tokenize(text)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print(nouns)

This works fine if my text is in english but it's not good anymore if my text is in french.

I was unable to find how to adapt this code for french language, how do I do that?

And is there a list somewhere of all the languages that are possible to parse?

Sulli
  • 763
  • 1
  • 11
  • 33
  • You have two separate code snippets. One uses `TextBlob` (lines 1 and 2). The other uses `nltk` (lines 3-5). Which one does not work? – DYZ Feb 05 '17 at 23:19
  • @DYZ both work with an english text, but both don't work with a french text. With a french text Textblob reports noun phrases that are not really phrases, and nltk reports words that are not nouns – Sulli Feb 06 '17 at 16:28

2 Answers2

5

Extract words from french sentence with NLTK

Under WSL2 Ubuntu with Python3, I can download Punkt like this:

import nltk
nltk.download('punkt')

The zip archive has been downloaded under:

/home/my_username/nltk_data/tokenizers/punkt.zip

Once it has been unzipped, you've got many language stored as Pickle's serialized object.

Now with:

tokenizer = nltk.data.load('path/to/punkt_folder/french.pickle')

You can use the tokenizer._tokenize_words method:

words_generator = tokenizer._tokenize_words("Depuis huit jours, j'avais déchiré mes bottines Aux cailloux des chemins. J'entrais à Charleroi. - Au Cabaret-Vert : je demandai des tartines De beurre et du jambon qui fût à moitié froid.")
words = [word for word in words_generator]

words is a list of PunktToken object:

>>> words
[PunktToken('Depuis', type='depuis', linestart=True), PunktToken('huit', ), PunktToken('jours', ),... PunktToken('à', ), PunktToken('moitié', ), PunktToken('froid.', )]
>>> str_words = [str(w) for w in words]
>>> str_words
['Depuis', 'huit', 'jours', ',', 'j', "'avais", 'déchiré', 'mes', 'bottines', 'Aux', 'cailloux', 'des', 'chemins.', 'J', "'entrais", 'à', 'Charleroi.', '-', 'Au', 'Cabaret-Vert', ':', 'je', 'demandai', 'des', 'tartines', 'De', 'beurre', 'et', 'du', 'jambon', 'qui', 'fût', 'à', 'moitié', 'froid.']

Use nltk.pos_tag with french sentences

The OP want to use nltk.pos_tag. It is not possible with the method described previously.

A way to go seems to install the Standford Tagger which has been coded in JAVA (found in this other SO question)

Download the lastest version of Standford Tagger (Available here)

> wget https://nlp.stanford.edu/software/stanford-tagger-4.2.0.zip

Once unzipped, you've got a folder which looks like this (OP ask the list of available languages):

...
├── data
│   ....
├── models
    ...
│   ├── arabic-train.tagger
│   ├── arabic-train.tagger.props
│   ├── arabic.tagger
│   ├── arabic.tagger.props
│   ├── chinese-distsim.tagger
│   ├── chinese-distsim.tagger.props
│   ├── chinese-nodistsim.tagger
│   ├── chinese-nodistsim.tagger.props
│   ├── english-bidirectional-distsim.tagger
│   ├── english-bidirectional-distsim.tagger.props
│   ├── english-caseless-left3words-distsim.tagger
│   ├── english-caseless-left3words-distsim.tagger.props
│   ├── english-left3words-distsim.tagger
│   ├── english-left3words-distsim.tagger.props
│   ├── french-ud.tagger
│   ├── french-ud.tagger.props
│   ├── german-ud.tagger
│   ├── german-ud.tagger.props
│   ├── spanish-ud.tagger
│   └── spanish-ud.tagger.props
─ french-ud.tagger.props
    ...
├── stanford-postagger-4.2.0.jar
...

Java must be installed and you must know where. Now you can do:

import os

from nltk.tag import StanfordPOSTagger
from textblob import TextBlob

jar = 'path/to/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = 'path/to/stanford-postagger-full-2020-11-17/models/french-ud.tagger'
os.environ['JAVAHOME'] = '/path/to/java'

blob = TextBlob("""
    Depuis huit jours, j'avais déchiré mes bottines Aux cailloux des chemins. J'entrais à Charleroi. - Au Cabaret-Vert : je demandai des tartines De beurre et du jambon qui fût à moitié froid.
""")

pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
res = pos_tagger.tag(blob.split())
print(res)

It will display:

[('Depuis', 'ADP'), ('huit', 'NUM'), ('jours,', 'NOUN'), ("j'avais", 'ADJ'), ('déchiré', 'VERB'), ('mes', 'DET'), ('bottines', 'NOUN'), ('Aux', 'PROPN'), ('cailloux', 'VERB'), ('des', 'DET'), ('chemins.', 'NOUN'), ("J'entrais", 'ADJ'), ('à', 'ADP'), ('Charleroi.', 'PROPN'), ('-', 'PUNCT'), ('Au', 'PROPN'), ('Cabaret-Vert', 'PROPN'), (':', 'PUNCT'), ('je', 'PRON'), ('demandai', 'VERB'), ('des', 'DET'), ('tartines', 'NOUN'), ('De', 'ADP'), ('beurre', 'NOUN'), ('et', 'CCONJ'), ('du', 'DET'), ('jambon', 'NOUN'), ('qui', 'PRON'), ('fût', 'AUX'), ('à', 'ADP'), ('moitié', 'NOUN'), ('froid.', 'ADJ')]

Et voilà !

snoob dogg
  • 2,491
  • 3
  • 31
  • 54
2

By default NLTK uses the English tokenizer, which will have strange or undefined behavior in French.

@fpierron is correct. If you read the article it mentions, you simply have to load the correct tokenizer language model and use it in your program.

import nltk.data
#chargement du tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')
tokens = tokenizer.tokenize("Jadis, une nuit, je fus un papillon, voltigeant, content de son sort. Puis, je m’éveillai, étant Tchouang-tseu. Qui suis-je en réalité ? Un papillon qui rêve qu’il est Tchouang-tseu ou Tchouang qui s’imagine qu’il fut papillon ?")

print(tokens) 

['Jadis, une nuit, je fus un papillon, voltigeant, content de son sort.', 'Puis, je m’éveillai, étant Tchouang-tseu.', 'Qui suis-je en réalité ?', 'Un papillon qui rêve qu’il est Tchouang-tseu ou Tchouang qui s’imagine qu’il fut papillon ?']

If you don't have the correct file you can use "nltk.download()" to download the correct model for french.

if you look at NLTKs website on the tokenizer, there are some other examples. http://www.nltk.org/api/nltk.tokenize.html

PAP
  • 38
  • 4
Nathan McCoy
  • 3,092
  • 1
  • 24
  • 46
  • 6
    The tokens you display are not those of the sentence : you are using two different sentences "Jadis je fus un papillon voltigeant ..." and "Le courage de la goutte d'eau c'est ..." – titus Apr 10 '18 at 21:27
  • 1
    I think this tokenizer only separates sentences, it does not extract words. – Be Chiller Too Sep 05 '19 at 08:44
  • @Nathan the correct path is 'tokenizers/punkt/french.pickle', it doesn't work when I added 'PY3'. – Belkacem Thiziri Jan 19 '21 at 11:20