Repeatably hashing an arbitrary Python tuple

Question

I'm writing a specialised unit testing tool that needs to save the results of tests to be compared against in the future. Thus I need to be able to consistently map parameters that were passed to each test to the test result from running the test function with those parameters for each version. I was hoping there was a way to just hash the tuple and use that hash to name the files where I store the test results.

My first impulse was just to call hash() on the tuple of parameters, but of course that won't work since hash is randomized between interpreter instances now.

I'm having a hard time coming up with a way that works for whatever arbitrary elements that might be in the tuple (I guess restricting it to a mix of ints, floats, strings, and lists\tuples of those three would be okay). Any ideas?

I've thought of using the repr of the tuple or pickling it, but repr isn't guaranteed to produce byte-for-byte same output for same input, and I don't think pickling is either (is it?)

I've seen this already, but the answers are all based on that same assumption that doesn't hold anymore and don't really translate to this problem anyway, a lot of the discussion was about making the hash not depend on the order items come up and I do want the hash to depend on order.

I would pickle the tuple of the parameters _and_ the results into the same file whose name is the hashed tuple of parameters. That way, you should not need to worry about the randomization, because the original tuple is in the file. — DYZ, Jan 30 '18 at 02:45
@DYZ Right, I'm doing that too, but I need the tuple to be hashed repeatably to be able to find the file in the first place. — Schilcote, Jan 30 '18 at 02:47
Is [disabling the hash randomization](https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED) acceptable? — ShadowRanger, Jan 30 '18 at 02:47
@ShadowRanger That... would work, I suppose, but it's horribly inelegant and does technically mean someone can DOS my CI server with a specially crafted merge request. — Schilcote, Jan 30 '18 at 02:49
If the parameters are all str-ingifiable, use a general purpose (eg. SHA) hash? Or crib something like Super7? :} — user2864740, Jan 30 '18 at 02:49
@Schilcote: So if this is a public facing server, that's a bad idea; the question made it sound like this was just for repeatable (assumed local) unit tests. — ShadowRanger, Jan 30 '18 at 02:50
@user2864740 That _would_ work if we were only accepting numbers and strs, but arbitrary objects might not str to the same thing every time (and if repr is not overridden, by default they don't!) — Schilcote, Jan 30 '18 at 02:50
For exactly your use-case, I'd avoid using the built-in `hash` function anyway as `hash(-1) == hash(-2)`, this also affects compound types: `hash(tuple('a', -1)) == hash(tuple('a', -2)`, etc. Unless you can guarantee that none of your test runs will have a parameter of -1 and another run of -2 on the same parameter, I'd avoid it. (`hash(-1) == hash(-2)` persists to at least Python version 3.8.2. — mkoistinen, Oct 27 '20 at 14:23

Tom Tang · Accepted Answer · 2018-01-30T03:55:21.393

5

Not sure if I understand your question fully, but will just give it a try.

Before you do the hash, just serialize the result to a JSON string, and do the hash computing on your JSON string.

params = (1, 3, 2)
hashlib.sha224(json.dumps(params)).hexdigest()
# '5f0f7a621e6f420002d54ee28b0c169b8112ef72d8a6b60e6a25171c'

If your params is a dictionary, use sort_keys=True to ensure your keys are sorted.

params = {'b': 123, 'c': 345}
hashlib.sha224(json.dumps(params, sort_keys=True)).hexdigest()
# '2e75966ce3f1185cbfb4eccc49d5552c08cfb7502a8765fe1dce9303'

edited Jan 30 '18 at 03:55

answered Jan 30 '18 at 02:49

Tom Tang

1,064
9
10

Is the JSON result guaranteed to be the same every time? – Schilcote Jan 30 '18 at 02:49
@Schilcote For *lists* or *tuples* (and any stable primitives in such) it should be. – user2864740 Jan 30 '18 at 02:51
If you're serializing a tupple / list, then yes. – Tom Tang Jan 30 '18 at 02:51
2

And for dictionaries, you can set `sort_keys` to `True`. – DYZ Jan 30 '18 at 02:51
Oh, but this'll only work for things that the JSON library can handle natively... I guess that's still okay, but I was hoping for support for arbitrary objects. – Schilcote Jan 30 '18 at 02:53
1

@DYZ: A caution: `sort_keys` only works on Python 3 if the keys are homogeneous types (or otherwise have defined comparisons for all pairs of heterogeneous types, e.g. a mix of `int` and `float` is fine, but `int` and `str` is not). On Python 2, the fallback comparison allows it to (usually) work (though not necessarily repeatably, since the same-type fallback comparison is based on memory address, which isn't repeatable), but on Python 3 you'll just get a `TypeError`. – ShadowRanger Jan 30 '18 at 02:57
@ShadowRanger Yup, agree. – DYZ Jan 30 '18 at 02:58
@schilcote, if you've already serialize the result into files, then I will suggest you do calculate a sha hash from your file and use that sha hash value as the name of the file. Assuming the same test result will produce the same file (byte to byte comparison ). – Tom Tang Jan 30 '18 at 03:05
@LiyingTang This poses the same problem of searchability: the OP will not be able to find the file that corresponds to a particular parameter set. – DYZ Jan 30 '18 at 03:09
@DYZ I thought he / she only want to use the file name (aka a hash ) to do a quick comparison to test if the result match or not ... Or I am missing something here ? – Tom Tang Jan 30 '18 at 03:13
@LiyingTang See the second comment to the original post. – DYZ Jan 30 '18 at 03:14
@DYZ, ah got you. The name is the parameter only, not the actual result. – Tom Tang Jan 30 '18 at 03:16

score 2 · Answer 2 · answered Jan 30 '18 at 02:49

2

One approach for simple tests would be to disable the hash randomization entirely by setting PYTHONHASHSEED=0 in the environment that launches your script, e.g., in bash, doing:

export PYTHONHASHSEED=0

answered Jan 30 '18 at 02:49

ShadowRanger

143,180
12
188
271

Note: This is only for the case where the tests are local; doing it on a public facing web service would expose you to denial of service attacks (which is what hash randomization was designed to protect you from). – ShadowRanger Jan 30 '18 at 02:51
Disabling the has randomization won't help the problem with `hash(-1) == hash(-2)` as all integers hash to themselves, except -1, which hashes to -2 (at least as recently as Python 3.8.2) – mkoistinen Oct 27 '20 at 14:28
@mkoistinen: Sure? But that's a problem with hashing in general, and kind of irrelevant to this answer. Hash randomization is intended to remove the ability to craft colliding hashes; disabling it allows you to find colliding strings just as easily as you found colliding `int`s. `hash` of an `int` is going to potentially collide whether or not you disable it. – ShadowRanger Oct 27 '20 at 15:06
My response wasn't a criticism of your helpful post, nor did I downvote it, but rather my comment is a warning to others who attempt to use `hash()` similarly to the original question. – mkoistinen Oct 28 '20 at 00:44

Repeatably hashing an arbitrary Python tuple

2 Answers2