issues with sentence detection using nltk

Asked Oct 19 '19 at 18:46

Active Oct 19 '19 at 19:20

Viewed 86 times

I have issues with nltk recognizing this as one sentence, because of the exclamation mark in the quotation marks.

s = "Donc ce n'est pas non plus de vous dire « Allez absolument ici ! », non."

I tried:

from nltk.tokenize import sent_tokenize
sent_tokenize(s, language='french')

but I get:

["Donc ce n'est pas non plus de vous dire « Allez absolument ici !", '», non.']

I am wondering if there is a better sentence detection method out there ?

asked Oct 19 '19 at 18:46

henry

You probably want to take a look [here](https://stackoverflow.com/questions/32003294/sentence-tokenization-for-texts-that-contains-quotes) to start with. – MyNameIsCaleb Oct 19 '19 at 19:01
I'm wondering if you replace the `« »` characters with standard quotes if it will tokenize properly on its own? – MyNameIsCaleb Oct 19 '19 at 19:07
@MyNameIsCaleb I actually tried that. No change, unfortunately. :/ – henry Oct 19 '19 at 19:08

0 Answers0