1

Update:

Despite the rigorous cleaning, some words with periods are still being tokenized with the periods intact, including strings that are padded with spaces after between periods and quotation marks. I've created a public link with an example of the problem here in a Jupyter Notebook: https://drive.google.com/file/d/0B90qb2J7ZLYrZmItME5RRlhsVWM/view?usp=sharing

Or a shorter example:

word_tokenize('This is a test. "')
['This', 'is', 'a', 'test.', '``']

But disappears when the other type of double-quote is used:

word_tokenize('This is a test. ”')
['This', 'is', 'a', 'test', '.', '”']

Original:

I'm stemming a large corpus of text and created a counter to see the counts of each word, and I transferred that counter to a dataframe for easier handling. Each piece of text is a large string of between 100-5000 words. The dataframe with the word counts looks like this, taking words that only have counts of 11, for instance:

allwordsdf[(allwordsdf['count'] == 11)]


        words          count
551     throughlin     11
1921    rampd          11
1956    pinhol         11
2476    reckhow        11

What I've noticed is that there are a lot of words that weren't fully stemmed, and they have periods attached to the end. For instance:

4233    activist.   11
9243    storyline.  11

I'm not sure what accounts for this. I know it's typically stemming periods separately, because the period row stands at:

23  .   5702880

Also, it seems like it's not doing it for every instance of, say, 'activist.':

len(articles[articles['content'].str.contains('activist.')])
9600

Not sure if I'm overlooking something---yesterday I ran into a problem with the NLTK stemmer that was a bug, and I don't know if it's that or something I'm doing (always more likely).

Thanks for any guidance.

Edit:

Here's the function I'm using:

progress = 0
start = time.time()

def stem(x):
    end = time.time()
    tokens = word_tokenize(x)
    global start
    global progress
    progress += 1
    sys.stdout.write('\r {} percent, {} position, {} per second '.format(str(float(progress / len(articles))), 
                                                         str(progress), (1 / (end - start))))

    stems = [stemmer.stem(e) for e in tokens]
    start = time.time()
    return stems


articles['stems'] = articles.content.apply(lambda x: stem(x))

Edit 2:

Here is a JSON to some of the data: all the strings, tokens and stems.

And this is a snippet of what I get when I look for all the words, after tokenizing and stemming, that still have periods:

allwordsdf[allwordsdf['words'].str.contains('\.')] #dataframe made from the counter dict

    words   count
23  .       5702875
63  years.  1231
497 was.    281
798 lost.   157
817 jie.    1
819 teacher.24
858 domains.1
875 fallout.3
884 net.    23
889 option. 89
895 step.   67
927 pool.   30
936 that.   4245
954 compute.2
1001 dr.    11007
1010 decisions. 159

The length of that slice comes out to about 49,000.

Edit 3:

Alvas's answer helped cut down the number of words with periods by about half, to 24,000 unique words and a total count of 518980, which is a lot. The problem, as I discovered, is that it's doing it EVERY time there's a period and quotation mark. For instance, take the string 'sickened.`, which appears once in the tokenized words.

If I search the corpus:

articles[articles['content'].str.contains(r'sickened\.[^\s]')]

The only place in entire corupus it shows up is here:

...said he was “sickened.” Trump's running mate...

This is not an isolated incident, but is what I've seen over and over while searching for these terms. They have a quotation mark after them every time. The tokenizer can't just not handle words with character-period-quotation-character, but also character-period-quotation-whitespace.

snapcrack
  • 1,761
  • 3
  • 20
  • 40
  • Tokenize before stemming, Can you please put up the code and a sample data? so that we can help you better? – alvas Jul 13 '17 at 23:05
  • Of course, my apologies. I'll put it up now. – snapcrack Jul 13 '17 at 23:08
  • Can you put example where the tokenizer get it wrong? – titipata Jul 13 '17 at 23:31
  • @titipata There's the example up top with `activist.` I tried looking for the strings in which that sub-string appears, but it returned thousands of results, meaning that it's difficult to determine where exactly it's going wrong. – snapcrack Jul 14 '17 at 00:02
  • Can you share a sample of your data? Just 5 rows will do. – alvas Jul 14 '17 at 04:49
  • The challenge with just sharing five rows is that I have no idea if those rows will reproduce the problem. It's a few needles in a haystack and I can't even find where in the df the needles are coming from. – snapcrack Jul 14 '17 at 05:57
  • BTW, which NLTK version are you using? `python -c "import nltk; print(nltk.__version__)"` – alvas Jul 14 '17 at 06:10
  • 3.2.4 (more characters here to fulfill comment minimum) – snapcrack Jul 14 '17 at 06:14
  • 1
    The real TL;DNR answer is: The stemmer is working correctly, you need to improve the tokenization. (Either by fixing the scraping, as alvas suggested, or simply by post-processing each token to remove stray punctuation before stemming. – alexis Jul 14 '17 at 13:07
  • Agrees with @alexis – alvas Jul 15 '17 at 02:51
  • @alexis This: 'This building is a symbol of a house not made with hands wherein shall dwell the spirit of truth, justice, and comradeship. "', with a space placed before the quotation mark, was tokenized/stemmed as `comradeship.` I'm not sure how to explain this other than by something being wrong. Note that it only happens when I tokenize/stem lots and lots of words; simply putting the lone string in the tokenizer/stemmer doesn't produce the same result. – snapcrack Jul 15 '17 at 07:36
  • See update up top for link to the notebook – snapcrack Jul 15 '17 at 08:06
  • You do realize that tokenization and stemming are separate steps in your code, don't you? So "how to explain this" is clear: Your text was **tokenized** wrong, and your example with a trailing quote shows that nicely. The reason you care is because it interferes with stemming, ok. But the solution is straightforward: Remove stray punctuation from each token before you stem it. – alexis Jul 17 '17 at 22:14

2 Answers2

2

You need to tokenize the string before stemming:

>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize
>>> text = 'This is a foo bar sentence, that contains punctuations.'
>>> porter = PorterStemmer()
>>> [porter.stem(word) for word in text.split()]
[u'thi', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', u'contain', 'punctuations.']
>>> [porter.stem(word) for word in word_tokenize(text)]
[u'thi', 'is', 'a', 'foo', 'bar', u'sentenc', ',', 'that', u'contain', u'punctuat', '.']

In a dataframe:

porter = PorterStemmer()
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

>>> import pandas as pd
>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize
>>> sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
>>> df = pd.DataFrame(sents, columns=['content'])
>>> df
                        content
0  This is a foo bar, sentence.
1         Yet another, foo bar!

# Apply tokenizer.
>>> df['tokens'] = df['content'].apply(word_tokenize)
>>> df
                        content                                   tokens
0  This is a foo bar, sentence.  [This, is, a, foo, bar, ,, sentence, .]
1         Yet another, foo bar!           [Yet, another, ,, foo, bar, !]

# Without DataFrame.apply
>>> df['tokens'][0]
['This', 'is', 'a', 'foo', 'bar', ',', 'sentence', '.']
>>> [porter.stem(word) for word in df['tokens'][0]]
[u'thi', 'is', 'a', 'foo', 'bar', ',', u'sentenc', '.']

# With DataFrame.apply
>>> df['tokens'].apply(lambda row: [porter.stem(word) for word in row])
0    [thi, is, a, foo, bar, ,, sentenc, .]
1             [yet, anoth, ,, foo, bar, !]

# Or if you like nested lambdas.
>>> df['tokens'].apply(lambda x: map(lambda y: porter.stem(y), x))
0    [thi, is, a, foo, bar, ,, sentenc, .]
1             [yet, anoth, ,, foo, bar, !]
alvas
  • 115,346
  • 109
  • 446
  • 738
  • This is why I should have put the function up in the original post (sloppy of me not to). That seems like it's what I'm already doing unless I'm overlooking something? – snapcrack Jul 13 '17 at 23:13
  • The vectorized functions ended up stemming every letter. Not sure why there was such a different result than with the loop. In the meantime, upvoting for your help/effort – snapcrack Jul 14 '17 at 00:01
  • My bad, missed out the inner loop when rushing to commute, see updated answer. – alvas Jul 14 '17 at 01:33
  • Took me awhile to respond because these functions take a really long time, but it's the same problem: the period is left in some strings, and so the word isn't stemmed. I'm wondering if this is an nltk problem, since I can't imagine what's causing it. – snapcrack Jul 14 '17 at 04:48
  • Did you try `df['tokens'].apply(lambda row: [porter.stem(word) for word in row])`? The output is `[thi, is, a, foo, bar, ,, sentenc, .]` – alvas Jul 14 '17 at 04:48
  • Yup. And 99% of the time it seems to do just that. It's these unusual cases that stack up. I'll edit to demonstrate the magnitude and add some data. – snapcrack Jul 14 '17 at 04:53
  • Can you please upload your dataset somewhere? Otherwise we're just shooting in the dark. There's more chances that your data isn't clean or other parts of your code is causing the problem more than the porter stemmer function ;P – alvas Jul 14 '17 at 05:03
  • Sure. It'll take a few minutes to upload since it's sitting at about 2.9 gb right now, but I'll edit my post when it's up. – snapcrack Jul 14 '17 at 05:06
  • 2.9GB can't be possible a pure text file... Compress the file and upload =) – alvas Jul 14 '17 at 05:08
  • JSON with lots of stuff ;o. You're right, let me see if I can cut this down – snapcrack Jul 14 '17 at 05:09
  • JSON is in the edit. Let me know if there are problems. Thanks very much for your efforts on this – snapcrack Jul 14 '17 at 05:52
1

The code from the answer above works for clean text:

porter = PorterStemmer()
sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
articles = pd.DataFrame(sents, columns=['content'])
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

Looking at JSON file, you have very dirty data. Most probably when you scrapped the text from the website you didn't put spaces in between the <p>...</p> tags or section that you're extracting thus, it leads to chunks of text like:

“So [now] AlphaGo actually learns from its own searches to improve its neural networks, both the policy network and the value network, and this makes it learn in a much more general way. One of the things we’re most excited about is not just that it can play Go better but we hope that this’ll actually lead to technologies that are more generally applicable to other challenging domains.”AlphaGo is comprised of two networks: a policy network that selects the next move to play, and a value network that analyzes the probability of winning. The policy network was initially based on millions of historical moves from actual games played by Go professionals. But AlphaGo Master goes much further by searching through the possible moves that could occur if a particular move is played, increasing its understanding of the potential fallout.“The original system played against itself millions of times, but it didn’t have this component of using the search,” Hassabis tells The Verge. “[AlphaGo Master is] using its own strength to improve its own predictions. So whereas in the previous version it was mostly about generating data, in this version it’s actually using the power of its own search function and its own abilities to improve one part of itself, the policy net.”

Note that there are many instances where you have open quotes directly following a fullstop, e.g. domains.”AlphaGo.

And if you try to use the default NLTK word_tokenize function on this, you will get domains., , AlphaGo; i.e.

>>> from nltk import word_tokenize

>>> text = u"""“So [now] AlphaGo actually learns from its own searches to improve its neural networks, both the policy network and the value network, and this makes it learn in a much more general way. One of the things we’re most excited about is not just that it can play Go better but we hope that this’ll actually lead to technologies that are more generally applicable to other challenging domains.”AlphaGo is comprised of two networks: a policy network that selects the next move to play, and a value network that analyzes the probability of winning. The policy network was initially based on millions of historical moves from actual games played by Go professionals. But AlphaGo Master goes much further by searching through the possible moves that could occur if a particular move is played, increasing its understanding of the potential fallout.“The original system played against itself millions of times, but it didn’t have this component of using the search,” Hassabis tells The Verge. “[AlphaGo Master is] using its own strength to improve its own predictions. So whereas in the previous version it was mostly about generating data, in this version it’s actually using the power of its own search function and its own abilities to improve one part of itself, the policy net.”"""

>>> word_tokenize(text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']

>>> 'domains.' in word_tokenize(text)
True

So there are several ways to resolve this, here's a couple:

  • Try cleaning up your data before feeding them to the word_tokenize function, e.g. padding spaces between punctuations first

  • Try a different tokenizer, e.g. MosesTokenizer

Padding spaces between punctuations first

>>> import re
>>> clean_text = re.sub('([.,!?()])', r' \1 ', text)
>>> word_tokenize(clean_text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in word_tokenize(clean_text)
False

Using MosesTokenizer:

>>> from nltk.tokenize.moses import MosesTokenizer
>>> mo = MosesTokenizer()
>>> mo.tokenize(text)
[u'\u201c', u'So', u'&#91;', u'now', u'&#93;', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'&#91;', u'AlphaGo', u'Master', u'is', u'&#93;', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in mo.tokenize(text)
False

TL;DR

Use:

from nltk.tokenize.moses import MosesTokenizer
mo = MosesTokenizer()
articles['tokens'] = articles['content'].apply(mo.tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

Or:

articles['clean'] = articles['content'].apply(lambda x: re.sub('([.,!?()])', r' \1 ', x)
articles['tokens'] = articles['clean'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])
alvas
  • 115,346
  • 109
  • 446
  • 738
  • I have a total cleaning pipeline meant to address these very issues and I'm wondering if it simply didn't clean thoroughly enough or if I opened up an uncleaned version of the file. (For example, that JSON is pre-cleaning). Let me see if going through the cleaning again, thoroughly, fixes this. In the meantime, another upvote for your really rigorous help – snapcrack Jul 14 '17 at 06:37
  • Fix the data from the start, not in between. **Don't** clean it after scrapping. Simply pad each text section with spaces **when scrapping**. And **don't remove non-breaking spaces**, replace it with spaces. – alvas Jul 14 '17 at 06:38
  • The replace-with-spaces part I have in my cleaning process. The clean-while-scraping happened, but insufficiently. But point taken ;o – snapcrack Jul 14 '17 at 06:40
  • Hope it helps. But since this is not exactly a coding question. And more of a data management/scrapping/cleaning/ingesting question. Don't mind if I close the question to avoid confusion, esp. if there's no real breakage on the stemmer ;P – alvas Jul 14 '17 at 06:41
  • BTW, we're monitoring breakage on the stemmer function in NLTK, just in case, thus time/effort spent on the question =) – alvas Jul 14 '17 at 06:43
  • Had to run through it to check, but it looks like this was indeed the problem. I genuinely thought I'd addressed this very issue. But I think it's a useful question/answer, since it's easy to overlook this stuff and I imagine this isn't completely uncommon). Thanks again for all your help. – snapcrack Jul 14 '17 at 19:52