1

Having an issue with understanding why NLTK's word_tokenizer looks at the string "this's" and splits it into "this" "'" "s" instead of keeping them together. I've tested with "test's" and this works fine. When I tested with "results'" it split the apostrophe again. Is this just a particular thing that will happen with apostrophes?

1 Answers1

0

It is a normal behavior of NLTK and tokenizers in general to split this's -> this + 's. Because 's a clitique and they are two separate syntactic units.

>>> from nltk import word_tokenize
>>> word_tokenize("this's")
['this', "'s"]

For the case of result it's the same:

>>> word_tokenize("results'")
['results', "'"]

Why are the 's and ' a separate entity from its host?

For the case of this's, 's is an abbreviated form of is which denotes copula. In some cases, it's ambiguous and it can also denote possessive.

And for the 2nd case of results', ' is denoting possessive.

So if we POS tag the tokenized forms we get:

>>> from nltk import word_tokenize, pos_tag
>>> pos_tag(word_tokenize("results'"))
[('results', 'NNS'), ("'", 'POS')]

For the case of this's, the POS tagger thinks it's a possessive because people seldom use this's in written text:

>>> from nltk import word_tokenize, pos_tag
>>> pos_tag(word_tokenize("this's"))
[('this', 'DT'), ("'s", 'POS')]

But if we look at He's -> He + 's, it's clearer that 's is denoting the copula:

>>> pos_tag(word_tokenize("He's good."))
[('He', 'PRP'), ("'s", 'VBZ'), ('good', 'JJ'), ('.', '.')]

Related question: https://stackoverflow.com/a/47384013/610569

alvas
  • 115,346
  • 109
  • 446
  • 738