22

I'm trying to write a script to calculate all of the possible fuzzy string match matches to for a short string, or 'kmer', and the same code that works in Python 2.7.X gives me a non-deterministic answer with Python 3.3.X, and I can't figure out why.

I iterate over a dictionary, itertools.product, and itertools.combinations in my code, but I iterate over all of them to completion with no breaks or continues. In addition, I store all of my results in a separate dictionary instead of the one I'm iterating over. In short - I'm not making any mistakes that are obvious to me, so why is the behavior different between Python2 and Python3?

Sample, slightly simplified code below:

import itertools

def find_best_fuzzy_kmer( kmers ):
    for kmer, value in kmers.items():
        for similar_kmer in permute_string( kmer, m ):
            # Tabulate Kmer

def permute_string( query, m ):
    query_list = list(query)
    output = set() # hold output
    for i in range(m+1):
        # pre-calculate the possible combinations of new bases
        base_combinations = list(itertools.product('AGCT', repeat=i))
        # for each combination `idx` in idxs, replace str[idx]
        for positions in itertools.combinations(range(len(query_list)), i):
            for bases in base_combinations:
                # Generate Permutations and add to output
    return output
BioInfoBrett
  • 305
  • 1
  • 8
  • 2
    Please give a specific example of a problematic input, and the output you get under both Pythons. "non-deterministic" may be clear to you, but we have no idea what you mean ;-) – Tim Peters Nov 09 '13 at 04:11
  • 2
    My guess is that it has something to do with dictionaries and sets. A dictionary is a set of key:value pairs. A set is unordered, which means that you can't rely on the order of the objects in your implementation. What I'm thinking is that python 3 implements these two things in a different way than python 2. If your implementation relies on the order of either a dict or a set, you should not be using them. – Stephen Nov 09 '13 at 04:29
  • 1
    "I'm not making any mistakes that are obvious to me" Now you know: iterating over a dictionary and assuming it has *any* guarantee on the order of the iterations *is* an obvious mistake. If you want guarantees on the order use an `OrderedDict` or just `list(the_dict.items())` (which, by the way, uses less memory than `OrderedDict`). – Bakuriu Nov 09 '13 at 09:55
  • @Bakuriu, note that `list(the_dict.items())` does nothing to impose order; of course `sorted(the_dict.items())` does. – Tim Peters Nov 09 '13 at 18:19
  • @TimPeters Yes, I meant `sorted`. I don't know why I wrote a plain `list` instead... – Bakuriu Nov 10 '13 at 08:31

1 Answers1

35

If by "non-deterministic" you mean the order in which dictionary keys appear (when you iterate over a dictionary) changes from run to run, and the dictionary keys are strings, please say so. Then I can help. But so far you haven't said any of that ;-)

Assuming that's the problem, here's a little program:

d = dict((L, i) for i, L in enumerate('abcd'))
print(d)

and the output from 4 runs under Python 3.3.2:

{'d': 3, 'a': 0, 'c': 2, 'b': 1}
{'d': 3, 'b': 1, 'c': 2, 'a': 0}
{'d': 3, 'a': 0, 'b': 1, 'c': 2}
{'a': 0, 'b': 1, 'c': 2, 'd': 3}

The cause is hinted at from this part of python -h output:

Other environment variables:
...
PYTHONHASHSEED: if this variable is set to 'random', a random value is used
   to seed the hashes of str, bytes and datetime objects.  It can also be
   set to an integer in the range [0,4294967295] to get hash values with a
   predictable seed.

This is a half-baked "security fix", intended to help prevent DOS attacks based on constructing dict inputs designed to provoke quadratic-time behavior. "random" is the default in Python3.

You can turn that off by setting the envar PYTHONHASHSEED to an integer (your choice - pick 0 if you don't care). Then iterating a dict with string keys will produce them in the same order across runs.

As @AlcariTheMad said in a comment, you can enable the Python3 default behavior under Python 2 via python -R ....

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • 5
    @thefourtheye, "control" is mostly illusory - there's still nothing *defined* about dict iteration order. All this trick supplies is consistent cross-run order provided a dict is constructed in exactly the same way across runs (same sequence of key insertions and deletions). – Tim Peters Nov 09 '13 at 04:29
  • 2
    And, it probably goes without saying, but hash randomization also applies to `set` objects. – mgilson Nov 09 '13 at 04:36
  • 1
    And Python 3.4 finishes baking the remote DOS fix by switching to a crypto based hash :) – ncoghlan Dec 26 '13 at 13:50