Having an issue with understanding why NLTK's word_tokenizer looks at the string "this's" and splits it into "this" "'" "s" instead of keeping them together. I've tested with "test's" and this works fine. When I tested with "results'" it split the apostrophe again. Is this just a particular thing that will happen with apostrophes?
Asked
Active
Viewed 311 times
1
-
1I think this's (heh!) relevant: https://ell.stackexchange.com/q/145503 – Fred Larson Nov 21 '17 at 16:14
-
Have you tried adding \ before. IE `'this\'s'`? – Xantium Nov 21 '17 at 17:05
-
@Simon , I tried and didn't work – Darpan Ganatra Nov 22 '17 at 21:59
-
All good, just parsed it myself – Darpan Ganatra Nov 22 '17 at 22:01
1 Answers
0
It is a normal behavior of NLTK and tokenizers in general to split this's
-> this
+ 's
. Because 's
a clitique and they are two separate syntactic units.
>>> from nltk import word_tokenize
>>> word_tokenize("this's")
['this', "'s"]
For the case of result it's the same:
>>> word_tokenize("results'")
['results', "'"]
Why are the 's
and '
a separate entity from its host?
For the case of this's
, 's
is an abbreviated form of is
which denotes copula. In some cases, it's ambiguous and it can also denote possessive.
And for the 2nd case of results'
, '
is denoting possessive.
So if we POS tag the tokenized forms we get:
>>> from nltk import word_tokenize, pos_tag
>>> pos_tag(word_tokenize("results'"))
[('results', 'NNS'), ("'", 'POS')]
For the case of this's
, the POS tagger thinks it's a possessive because people seldom use this's
in written text:
>>> from nltk import word_tokenize, pos_tag
>>> pos_tag(word_tokenize("this's"))
[('this', 'DT'), ("'s", 'POS')]
But if we look at He's
-> He
+ 's
, it's clearer that 's
is denoting the copula:
>>> pos_tag(word_tokenize("He's good."))
[('He', 'PRP'), ("'s", 'VBZ'), ('good', 'JJ'), ('.', '.')]
Related question: https://stackoverflow.com/a/47384013/610569

alvas
- 115,346
- 109
- 446
- 738