0

I have issues with nltk recognizing this as one sentence, because of the exclamation mark in the quotation marks.

s = "Donc ce n'est pas non plus de vous dire « Allez absolument ici ! », non."

I tried:

from nltk.tokenize import sent_tokenize
sent_tokenize(s, language='french')

but I get:

["Donc ce n'est pas non plus de vous dire « Allez absolument ici !", '», non.']

I am wondering if there is a better sentence detection method out there ?

henry
  • 875
  • 1
  • 18
  • 48
  • You probably want to take a look [here](https://stackoverflow.com/questions/32003294/sentence-tokenization-for-texts-that-contains-quotes) to start with. – MyNameIsCaleb Oct 19 '19 at 19:01
  • I'm wondering if you replace the `« »` characters with standard quotes if it will tokenize properly on its own? – MyNameIsCaleb Oct 19 '19 at 19:07
  • @MyNameIsCaleb I actually tried that. No change, unfortunately. :/ – henry Oct 19 '19 at 19:08

0 Answers0