How to tokenize a text corpus?

Question

I want to tokenize a corpus of text using NLTK library.

My corpus looks like:

['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?",

I've tried:

tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]

which raised:

AttributeError: 'str' object has no attribute 'decode'

Help would be appreaciated. Thanks.

Possible duplicate of ['str' object has no attribute 'decode'. Python 3 error?](https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error) — Bram Vanroy, Aug 06 '19 at 19:33

score 1 · Answer 1 · answered Aug 06 '19 at 19:23

1

The error is right there, sent doesn't have the attribute decode. You only need to .decode() them if they were first encoded, i.e., bytes objects instead of str objects. Remove that and it should be fine.

answered Aug 06 '19 at 19:23

Brian

628
6
13

'''TypeError: expected string or bytes-like object''' To be honest, I'm just trying reproduce a code I've found on the web and I think the author's missing something there. I've tried to use nltk.sent_tokenize prior to word_tokenize but without a success. – Stanislav Jirák Aug 06 '19 at 19:35
What is the type of the elements in `corpus` then, and does it work with the smaller sample that you posted? – Brian Aug 06 '19 at 19:41
It has '[' but that's a string too. It has occasionally a string, but again, as a string. – Stanislav Jirák Aug 06 '19 at 20:00

null · Accepted Answer · 2019-08-06T20:13:11.640

As this page suggests word_tokenize method expect a string as an argument, just try

tok_corp = [nltk.word_tokenize(sent) for sent in corpus]

Edit: with the following code I can get the tokenized corpus,

Code:

import pandas as pd
from nltk import word_tokenize

corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?"]


tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])

Output:

      0     1     2           3        4   ...    13    14    15    16    17
0    Did   you  hear       about      the  ...   tea     ?  None  None  None
1   What    's   the        best     anti  ...  None  None  None  None  None
2   What    do   you        call        a  ...    no  arms   nor  legs     ?
3  Which  Star  Trek   character       is  ...  None  None  None  None  None
4   What    's   the  difference  between  ...  None  None  None  None  None

I think there are some non-strings or non-byte-like objects sneaked in your corpus. I recommend you to check it again.

'''TypeError: expected string or bytes-like object''' To be honest, I'm just trying reproduce a code I've found on the web and I think the author's missing something there. I've tried to use nltk.sent_tokenize prior to word_tokenize but without a success. — Stanislav Jirák, Aug 06 '19 at 19:35
is it possible that corpus which is a list might be containing some item other than string ? It would be great if you can debug the loop tbh. — null, Aug 06 '19 at 19:57
It has '[' but that's a string too. It has occasionally a string, but again, as a string. — Stanislav Jirák, Aug 06 '19 at 19:59

How to tokenize a text corpus?

2 Answers2