I want to tokenize a corpus of text using NLTK library.
My corpus looks like:
['Did you hear about the Native American man that drank 200 cups of tea?',
"What's the best anti diarrheal prescription?",
'What do you call a person who is outside a door and has no arms nor legs?',
'Which Star Trek character is a member of the magic circle?',
"What's the difference between a bullet and a human?",
I've tried:
tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]
which raised:
AttributeError: 'str' object has no attribute 'decode'
Help would be appreaciated. Thanks.