0

I am new to pyhton and nltk.I want to tokenize a string and add a few string to the split list in nltk.I used the code from the post How to tweak the NLTK sentence tokenizer. Below is the code which I have written

from nltk.tokenize import sent_tokenize
extra_abbreviations = ['\n']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

sent_tokenize_list = sentence_tokenizer(document)
sent_tokenize_list

This gives me the following error:

TypeError Traceback (most recent call last) in () 4 sentence_tokenizer._params.abbrev_types.update(extra_abbreviations) 5 ----> 6 sent_tokenize_list = sentence_tokenizer(document) 7 sent_tokenize_list

TypeError: 'PunktSentenceTokenizer' object is not callable

How do I fix this?

Community
  • 1
  • 1
swetha
  • 321
  • 5
  • 16
  • Hopefully, this helps: http://stackoverflow.com/a/35279885/610569 and https://github.com/alvations/DLTK/blob/master/dltk/tokenize/tokenizer.py#L49 – alvas May 09 '16 at 08:48

1 Answers1

2

This makes your example work:

import nltk
from nltk.tokenize import sent_tokenize
extra_abbreviations = ['\n']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
document = """This is my test doc. It has two sentences; however, one of wich with interesting punctuation."""
sent_tokenize_list = sentence_tokenizer.tokenize(document)
print(sent_tokenize_list)

Your error is due to the fact that sentence_tokenizer is an object. You have to call the function tokenize on the object.

Learn how to find out more about the capabilities of objects in the python docs

thorsten
  • 481
  • 5
  • 10