In short:
It has something to do with how python3
hashes keys when the similar()
function uses the Counter dictionary. See http://pastebin.com/ysAF6p6h
See How and why is the dictionary hashes different in python2 and python3?
In long:
Let's start with:
from nltk.book import *
The import here comes from https://github.com/nltk/nltk/blob/develop/nltk/book.py which import the nltk.text.Text
object and read several corpora into the Text
object.
E.g. This is how the text1
variable was read from nltk.book
:
>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
Now, if we go down to the code for the similar()
function at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L377, we see this initialization if it is the first instance of accessing self._word_context_index
:
def similar(self, word, num=20):
"""
Distributional similarity: find other words which appear in the
same contexts as the specified word; list most similar words first.
:param word: The word used to seed the similarity search
:type word: str
:param num: The number of words to generate (default=20)
:type num: int
:seealso: ContextIndex.similar_words()
"""
if '_word_context_index' not in self.__dict__:
#print('Building word-context index...')
self._word_context_index = ContextIndex(self.tokens,
filter=lambda x:x.isalpha(),
key=lambda s:s.lower())
word = word.lower()
wci = self._word_context_index._word_to_contexts
if word in wci.conditions():
contexts = set(wci[word])
fd = Counter(w for w in wci.conditions() for c in wci[w]
if c in contexts and not w == word)
words = [w for w, _ in fd.most_common(num)]
print(tokenwrap(words))
else:
print("No matches")
So that points us to the nltk.text.ContextIndex
object, that is suppose to collect all the words with the similar context window and store them. The docstring says:
A bidirectional index between words and their 'contexts' in a text.
The context of a word is usually defined to be the words that occur in
a fixed window around the word; but other definitions may also be used
by providing a custom context function.
By default if you're calling the similar()
function, it will initialize the _word_context_index
with the default context settings i.e. the left and right token window, see https://github.com/nltk/nltk/blob/develop/nltk/text.py#L40
@staticmethod
def _default_context(tokens, i):
"""One left token and one right token, normalized to lowercase"""
left = (tokens[i-1].lower() if i != 0 else '*START*')
right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*')
return (left, right)
From the similar()
function, we see that it iterates through the word in context stored in the word_context_index, i.e. wci = self._word_context_index._word_to_contexts
.
Essentially, _word_to_contexts
is a dictionary where the keys are the words in the corpus and the values are the left and right words from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L55:
self._word_to_contexts = CFD((self._key(w), self._context_func(tokens, i))
for i, w in enumerate(tokens))
And here we see that it's a CFD, which is a nltk.probability.ConditionalFreqDist
object, which does not include smoothing of token probability, see full code at https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L1646.
The only possibly of getting the different result is when the similar()
function loops through the most_common words at https://github.com/nltk/nltk/blob/develop/nltk/text.py#L402
Given that two keys in a Counter
object have the same counts, the word with a lower sorted hash will print out first and the hash of the key is dependent on the the CPU's bit-size, see http://www.laurentluce.com/posts/python-dictionary-implementation/
The whole process of finding the similar words itself is deterministic, since:
- the corpus/input is fixed
Text(gutenberg.words('melville-moby_dick.txt'))
- the default context for every word is also fixed, i.e.
self._word_context_index
- the computation of the conditional frequency distribution for
_word_context_index._word_to_contexts
is discrete
Except when the function outputs the most_common
list, which when there's a tie in the Counter
values, it will output the list of keys given their hashes.
In python2
, there's no reason to get a different output from different instances of the same machine with the following code:
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
But in Python3
, it gives a different output every time you run text1.similar('monstrous')
, see http://pastebin.com/ysAF6p6h
Here's a simple experiment to prove that quirky hashing differences between python2
and python3
:
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('barfoo', 1), ('foobar', 1), ('bar', 1), ('foo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foo', 1), ('barfoo', 1), ('bar', 1), ('foobar', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('bar', 1), ('barfoo', 1), ('foobar', 1), ('foo', 1)]