The similar method from the nltk module produces different results on different machines. Why?

Question

I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different results for text1.similar() than others.

All versions and etc. were the same.

Does anyone know why these differences would occur? Thanks.

Code used at command line.

python
>>> import nltk
>>> nltk.download() #here you use the pop-up window to download texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet

Those lists of terms returned by the similar method differ from user to user, they have many words in common, but they are not identical lists. All users were using the same OS, and the same versions of python and nltk.

I hope that makes the question clearer. Thanks.

Can you provide the input text and the code snippet that you have used? Then, we can try to walkthrough the code and see how to explain the difference. — alvas, Nov 06 '15 at 22:04
I just followed the instructions that are part of this page of the NLTK book. http://www.nltk.org/book/ch01.html — David Beales, Nov 07 '15 at 02:05
What is the bitsize of the machine you're using for your code snippet in the question? http://stackoverflow.com/questions/9964396/python-check-if-a-system-is-32-or-64-bit-to-determine-whether-to-run-the-funct , what is your output for `python -c "import struct; print struct.calcsize('P') * 8"` — alvas, Nov 07 '15 at 09:52
@b3000, cool i've never notices this: http://stackoverflow.com/questions/33810024/how-and-why-is-the-dictionary-hashes-different-in-python2-and-python3 — alvas, Nov 19 '15 at 17:02

score 16 · Accepted Answer · answered Nov 19 '15 at 16:35

In your example there are 40 other words which have exactly one context in common with the word 'monstrous'. In the similar function a Counter object is used to count the words with similar contexts and then the most common ones (default 20) are printed. Since all 40 have the same frequency the order can differ.

From the doc of Counter.most_common:

Elements with equal counts are ordered arbitrarily

I checked the frequency of the similar words with this code (which is essentially a copy of the relevant part of the function code):

from nltk.book import *
from nltk.util import tokenwrap
from nltk.compat import Counter

word = 'monstrous'
num = 20

text1.similar(word)

wci = text1._word_context_index._word_to_contexts

if word in wci.conditions():
            contexts = set(wci[word])
            fd = Counter(w for w in wci.conditions() for c in wci[w]
                          if c in contexts and not w == word)
            words = [w for w, _ in fd.most_common(num)]
            # print(tokenwrap(words))

print(fd)
print(len(fd))
print(fd.most_common(num))

Output: (different runs give different output for me)

Counter({'doleful': 1, 'curious': 1, 'delightfully': 1, 'careful': 1, 'uncommon': 1, 'mean': 1, 'perilous': 1, 'fearless': 1, 'imperial': 1, 'christian': 1, 'trustworthy': 1, 'untoward': 1, 'maddens': 1, 'true': 1, 'contemptible': 1, 'subtly': 1, 'wise': 1, 'lamentable': 1, 'tyrannical': 1, 'puzzled': 1, 'vexatious': 1, 'part': 1, 'gamesome': 1, 'determined': 1, 'reliable': 1, 'lazy': 1, 'passing': 1, 'modifies': 1, 'few': 1, 'horrible': 1, 'candid': 1, 'exasperate': 1, 'pitiable': 1, 'abundant': 1, 'mystifying': 1, 'mouldy': 1, 'loving': 1, 'domineering': 1, 'impalpable': 1, 'singular': 1})

score 6 · Answer 2 · edited May 23 '17 at 11:54

In short:

It has something to do with how python3 hashes keys when the similar() function uses the Counter dictionary. See http://pastebin.com/ysAF6p6h

See How and why is the dictionary hashes different in python2 and python3?

In long:

Let's start with:

from nltk.book import *

The import here comes from https://github.com/nltk/nltk/blob/develop/nltk/book.py which import the nltk.text.Text object and read several corpora into the Text object.

E.g. This is how the text1 variable was read from nltk.book:

>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))

Now, if we go down to the code for the similar() function at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L377, we see this initialization if it is the first instance of accessing self._word_context_index:

def similar(self, word, num=20):
    """
    Distributional similarity: find other words which appear in the
    same contexts as the specified word; list most similar words first.
    :param word: The word used to seed the similarity search
    :type word: str
    :param num: The number of words to generate (default=20)
    :type num: int
    :seealso: ContextIndex.similar_words()
    """
    if '_word_context_index' not in self.__dict__:
        #print('Building word-context index...')
        self._word_context_index = ContextIndex(self.tokens, 
                                                filter=lambda x:x.isalpha(), 
                                                key=lambda s:s.lower())


    word = word.lower()
    wci = self._word_context_index._word_to_contexts
    if word in wci.conditions():
        contexts = set(wci[word])
        fd = Counter(w for w in wci.conditions() for c in wci[w]
                      if c in contexts and not w == word)
        words = [w for w, _ in fd.most_common(num)]
        print(tokenwrap(words))
    else:
        print("No matches")

So that points us to the nltk.text.ContextIndex object, that is suppose to collect all the words with the similar context window and store them. The docstring says:

A bidirectional index between words and their 'contexts' in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function.

By default if you're calling the similar() function, it will initialize the _word_context_index with the default context settings i.e. the left and right token window, see https://github.com/nltk/nltk/blob/develop/nltk/text.py#L40

@staticmethod
def _default_context(tokens, i):
    """One left token and one right token, normalized to lowercase"""
    left = (tokens[i-1].lower() if i != 0 else '*START*')
    right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*')
    return (left, right)

From the similar() function, we see that it iterates through the word in context stored in the word_context_index, i.e. wci = self._word_context_index._word_to_contexts.

Essentially, _word_to_contexts is a dictionary where the keys are the words in the corpus and the values are the left and right words from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L55:

    self._word_to_contexts = CFD((self._key(w), self._context_func(tokens, i))
                                 for i, w in enumerate(tokens))

And here we see that it's a CFD, which is a nltk.probability.ConditionalFreqDist object, which does not include smoothing of token probability, see full code at https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L1646.

The only possibly of getting the different result is when the similar() function loops through the most_common words at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L402

Given that two keys in a Counter object have the same counts, the word with a lower sorted hash will print out first and the hash of the key is dependent on the the CPU's bit-size, see http://www.laurentluce.com/posts/python-dictionary-implementation/

The whole process of finding the similar words itself is deterministic, since:

the corpus/input is fixed Text(gutenberg.words('melville-moby_dick.txt'))
the default context for every word is also fixed, i.e. self._word_context_index
the computation of the conditional frequency distribution for _word_context_index._word_to_contexts is discrete

Except when the function outputs the most_common list, which when there's a tie in the Counter values, it will output the list of keys given their hashes.

In python2, there's no reason to get a different output from different instances of the same machine with the following code:

$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()

But in Python3, it gives a different output every time you run text1.similar('monstrous'), see http://pastebin.com/ysAF6p6h

Here's a simple experiment to prove that quirky hashing differences between python2 and python3:

alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]


alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('barfoo', 1), ('foobar', 1), ('bar', 1), ('foo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foo', 1), ('barfoo', 1), ('bar', 1), ('foobar', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('bar', 1), ('barfoo', 1), ('foobar', 1), ('foo', 1)]

I ran your last code snippet (3x the loading and calling `similar`) on a Windows (64-bit) and I get different output for each run. — b3000, Nov 19 '15 at 16:39
@b3000 Cool. Can you share you output somewhere? Even different OS has different things going on. — alvas, Nov 19 '15 at 16:42
These explanations and examples are excellent! Thank you to alvas and b3000 for the killer explanations/examples of the entire process and where the differences come from. Is there some way I can mark both answers as the right answer? — David Beales, Nov 21 '15 at 01:39
Mark b3000's , it's easier to understand the problem and people who wants to know more will scroll down to this answer =) — alvas, Nov 21 '15 at 02:04

The similar method from the nltk module produces different results on different machines. Why?

2 Answers2

Linked