The code from the answer above works for clean text:
porter = PorterStemmer()
sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
articles = pd.DataFrame(sents, columns=['content'])
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])
Looking at JSON file, you have very dirty data. Most probably when you scrapped the text from the website you didn't put spaces in between the <p>...</p>
tags or section that you're extracting thus, it leads to chunks of text like:
“So [now] AlphaGo actually learns from its own searches to improve its
neural networks, both the policy network and the value network, and
this makes it learn in a much more general way. One of the things
we’re most excited about is not just that it can play Go better but we
hope that this’ll actually lead to technologies that are more
generally applicable to other challenging domains.”AlphaGo is
comprised of two networks: a policy network that selects the next move
to play, and a value network that analyzes the probability of winning.
The policy network was initially based on millions of historical moves
from actual games played by Go professionals. But AlphaGo Master goes
much further by searching through the possible moves that could occur
if a particular move is played, increasing its understanding of the
potential fallout.“The original system played against itself millions
of times, but it didn’t have this component of using the search,”
Hassabis tells The Verge. “[AlphaGo Master is] using its own strength
to improve its own predictions. So whereas in the previous version it
was mostly about generating data, in this version it’s actually using
the power of its own search function and its own abilities to improve
one part of itself, the policy net.”
Note that there are many instances where you have open quotes directly following a fullstop, e.g. domains.”AlphaGo
.
And if you try to use the default NLTK word_tokenize
function on this, you will get domains.
, ”
, AlphaGo
; i.e.
>>> from nltk import word_tokenize
>>> text = u"""“So [now] AlphaGo actually learns from its own searches to improve its neural networks, both the policy network and the value network, and this makes it learn in a much more general way. One of the things we’re most excited about is not just that it can play Go better but we hope that this’ll actually lead to technologies that are more generally applicable to other challenging domains.”AlphaGo is comprised of two networks: a policy network that selects the next move to play, and a value network that analyzes the probability of winning. The policy network was initially based on millions of historical moves from actual games played by Go professionals. But AlphaGo Master goes much further by searching through the possible moves that could occur if a particular move is played, increasing its understanding of the potential fallout.“The original system played against itself millions of times, but it didn’t have this component of using the search,” Hassabis tells The Verge. “[AlphaGo Master is] using its own strength to improve its own predictions. So whereas in the previous version it was mostly about generating data, in this version it’s actually using the power of its own search function and its own abilities to improve one part of itself, the policy net.”"""
>>> word_tokenize(text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in word_tokenize(text)
True
So there are several ways to resolve this, here's a couple:
Try cleaning up your data before feeding them to the word_tokenize
function, e.g. padding spaces between punctuations first
Try a different tokenizer, e.g. MosesTokenizer
Padding spaces between punctuations first
>>> import re
>>> clean_text = re.sub('([.,!?()])', r' \1 ', text)
>>> word_tokenize(clean_text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in word_tokenize(clean_text)
False
Using MosesTokenizer
:
>>> from nltk.tokenize.moses import MosesTokenizer
>>> mo = MosesTokenizer()
>>> mo.tokenize(text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in mo.tokenize(text)
False
TL;DR
Use:
from nltk.tokenize.moses import MosesTokenizer
mo = MosesTokenizer()
articles['tokens'] = articles['content'].apply(mo.tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])
Or:
articles['clean'] = articles['content'].apply(lambda x: re.sub('([.,!?()])', r' \1 ', x)
articles['tokens'] = articles['clean'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])