0

I have a name of a tennis player that apparently is not parsed correctly by pandas. It is "radwa?ska". I have two functions where I tokenize using the word_tokenize() function. It tokenizes the sentence in two different ways
in the two different functions. How do I get the second one to go the first way
['radwa?ska']
['radwa','?','ska']
Here's the code in both the functions.
first function:
word_tokenize(keyword) the keyword is 'martina hingis, nadia petrova, agnieszka radwa?ska'
and in the second function is:
word_tokenize(content[j]) where content[j] is 'agnieszka radwa?ska'
.

Vikramark
  • 137
  • 13

1 Answers1

1

For both sentence in your original post, they should return the same output:

>>> from nltk import word_tokenize
>>> text1 = 'martina hingis, nadia petrova, agnieszka radwa?ska'
>>> text2 = 'agnieszka radwa?ska'

>>> word_tokenize(text1)
['martina', 'hingis', ',', 'nadia', 'petrova', ',', 'agnieszka', 'radwa', '?', 'ska']

>>> word_tokenize(text2)
['agnieszka', 'radwa', '?', 'ska']

It's because the TreebankWordTokenizer object behind word_tokenize always always puts a space before and after a question mark at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L80


One can hack some ways to disable that regex but it will cause problems somewhere else down the road.

But I urge you to take a closer look at how you're reading/collecting your data, the fact that radwa?ska appears hint at some encoding problems exists upstream before tokenization. The correct reading of the file/stream would have got you radwańska.

See also

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    I checked upstream and found out that, in the dataframe I have `radwańska`. When I write to csv using `pandas.DataFrame.to_csv(path)` it messes up the name. Then I tried `utf-8-sig` encoding which correctly wrote to csv, but that gave `unicode-error` later when I tried to `pandas.DataFrame.read_csv(path)` . I used various combinations of `utf-8`,`utf-8-sig`,`latin1` but none worked – Vikramark Dec 06 '17 at 11:41