1

I am using NLTK to tokenize articles from wikipedia into sentences. But the punkt tokenizer is not giving very good results as sometime it is creating problems like sentences are getting tokenized when etc. appears, or problems occurs when double inverted commas appear in text like ['as they say "harry is a good boy.', '" He thinks'] and so on.

I want to stick to NLTK itself as this is something which is sandwiched between some other processes.

Are there any other classifiers that can be used?

I don't mind using any other library in python as well.

Amrith Krishna
  • 2,768
  • 3
  • 31
  • 65
  • What about just split the text using `.` as separator?? And of course, handle the `etc.` case yourself. Perhaps a regular expression? – Raydel Miranda Sep 17 '14 at 15:25
  • I have tried that, but that again gives more trouble. – Amrith Krishna Sep 17 '14 at 15:26
  • 2
    Stanford CoreNLP has a sentence segmenter and tokenizer: http://stackoverflow.com/questions/9492707/how-can-i-split-a-text-into-sentences-using-the-stanford-parser/9493264#9493264 It's not in Python, but there are Python interfaces. – dmcc Sep 17 '14 at 18:42

1 Answers1

3

Try this using regular expressions for splitting the text, you could use negative lookbehind assertion:

import re
# This is the Lorem ipsum text modified a little bit in order to match your requirements.
# Note the following:
# 1. - "et dolore magna"  --> the presence of `"`
# 2. - Sunt, in culpa etc. qui ... --> The presence if `etc.`
text = """Lorem ipsum dolor sit amet. Consectetur adipisicing elit, sed do eiusmod 
tempor incididunt ut labore "et dolore magna" aliqua. Ut enim ad minim veniam. Cillum dolore 
proident. Sunt, in culpa, etc. qui officia deserunt mollit anim id est laborum."""

# Here is used the negative lookbehind assertion to split the text using any point
# `.` not preceded by `etc` as separator.
sentences = re.split("(?<!etc)\.", text)
# Then all white spaces are removed to leave just the words.
sentences = [" ".join(re.findall("\w+", sentence)) for sentence in sentences]
# Finally,
print(sentences)

Of course all this is better if we have a function we can use whenever we want.

def get_sentences(text):
    sentences = re.split("(?<!etc)\.", text)
    return [" ".join(re.findall("\w+", sentence)) for sentence in sentences]

# Example of use.
print(get_sentences(text))

IMPORTANT

If you find another exception like etc., lets say, NLTK. you can add it to the splitter pattern like this:

...
sentences = re.split("(?<!(etc|NLTK)\.", text)
...

References:

Regular Expression HOWTO

Regular expression operations

Raydel Miranda
  • 13,825
  • 3
  • 38
  • 60