tokenltk word_tokenize() tokenizes the same word two different ways in different functions

Question

I have a name of a tennis player that apparently is not parsed correctly by pandas. It is "radwa?ska". I have two functions where I tokenize using the word_tokenize() function. It tokenizes the sentence in two different ways
in the two different functions. How do I get the second one to go the first way
['radwa?ska']
['radwa','?','ska']
Here's the code in both the functions.
first function:
word_tokenize(keyword) the keyword is 'martina hingis, nadia petrova, agnieszka radwa?ska'
and in the second function is:
word_tokenize(content[j]) where content[j] is 'agnieszka radwa?ska'
.

score 1 · Answer 1 · answered Dec 06 '17 at 11:19

For both sentence in your original post, they should return the same output:

>>> from nltk import word_tokenize
>>> text1 = 'martina hingis, nadia petrova, agnieszka radwa?ska'
>>> text2 = 'agnieszka radwa?ska'

>>> word_tokenize(text1)
['martina', 'hingis', ',', 'nadia', 'petrova', ',', 'agnieszka', 'radwa', '?', 'ska']

>>> word_tokenize(text2)
['agnieszka', 'radwa', '?', 'ska']

It's because the TreebankWordTokenizer object behind word_tokenize always always puts a space before and after a question mark at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L80

One can hack some ways to disable that regex but it will cause problems somewhere else down the road.

But I urge you to take a closer look at how you're reading/collecting your data, the fact that radwa?ska appears hint at some encoding problems exists upstream before tokenization. The correct reading of the file/stream would have got you radwańska.

tokenltk word_tokenize() tokenizes the same word two different ways in different functions

1 Answers1