18

The hashlib Python module provides the following hash algorithms constructors: md5(), sha1(), sha224(), sha256(), sha384(), and sha512().

Assuming I don't want to use md5, is there a big difference in using, say, sha1 instead of sha512? I want to use something like hashlib.shaXXX(hashString).hexdigest(), but as it's just for caching, I'm not sure I need the (eventual) extra overhead of 512...

Does this overhead exist, and if so, how big is it?

Emilien
  • 2,971
  • 2
  • 22
  • 32
  • 2
    If it is for caching, why do you need a secure hash? – President James K. Polk Feb 10 '10 at 23:12
  • 4
    When you tried the different methods and measured their performance, what did you find? – Greg Hewgill Feb 10 '10 at 23:14
  • 2
    What @GregHewgill probably meant to say is that there is a convenient standard library module `timeit` which makes such measurements so trivial that it's easier to just time it than to ask about it, especially when it's run from the command line. – Peter Hansen Feb 11 '10 at 01:39
  • @GregS what would you use instead of a secure hash? I need to have a fingerprint of a python dict (containing search parameters) so that I can determine (1) if this search has already been done before and (2, if it's a new search) if the results of the search are available for display. – Emilien Feb 11 '10 at 16:44
  • 1
    I defer to sttwister. Since I learned something from the exchange, I'm upvoting the question. – President James K. Polk Feb 13 '10 at 00:54

2 Answers2

23

Why not just benchmark it?

>>> def sha1(s):
...     return hashlib.sha1(s).hexdigest()
...
>>> def sha512(s):
...     return hashlib.sha512(s).hexdigest()
...
>>> t1 = timeit.Timer("sha1('asdf' * 100)", "from __main__ import sha1")
>>> t512 = timeit.Timer("sha512('asdf' * 100)", "from __main__ import sha512")
>>> t1.timeit()
3.2463729381561279
>>> t512.timeit()
6.5079669952392578

So on my machine, hash512 is twice as slow as sha1. But as GregS said, why would you use secure hash for caching? Try the builtin hash algorithms which should be really fast and tuned:

>>> s = "asdf"
>>> hash(s)
-618826466
>>> s = "xxx"
>>> hash(s)
943435
>>> hash("xxx")
943435

Or better yet, use the builtin Python dictionaries. Maybe you can tell us more about what you plan on caching.

EDIT: I'm thinking that you are trying to achieve something like this:

hash = hashlib.sha1(object_to_cache_as_string).hexdigest()
cache[hash] = object_to_cache

What I was refferring to by "use the builtin Python dictinoaries" is that you can simplify the above:

cache[object_to_cache_as_string] = object_to_cache

In this way, Python takes care of the hashing so you don't have to!

Regarding your particular problem, you could refer to Python hashable dicts in order to make a dictionary hashable. Then, all you'd need to do to cache the object is:

cache[object_to_cache] = object_to_cache

EDIT - Notes about Python3

Python 3.3 introduces hash randomization, which means that computed hashes might be different across different processes, so you should not rely on the computed hash, unless setting the PYTHONHASHSEED environment variable to 0.

References: - https://docs.python.org/3/reference/datamodel.html#object.hash - https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED

Community
  • 1
  • 1
sttwister
  • 2,279
  • 1
  • 19
  • 23
  • Thanks for taking the time to benchmark it. As many of you said, I probably don't need to use secure hashing for caching. Basically I need to store a fingerprint of [the content of] a dictionary. As I can't use either `hashlib` or `hash()` directly on a dictionary, I was building a string containing the elements of that dictionary (don't like this approach) and then use `hashlib` on it... But now you've intrigued me with "use the builtin Python dictionaries", what do you mean by that? – Emilien Feb 11 '10 at 16:42
  • By reading your comments (all of you), I realized I didn't need to use any secure hashing, so I implemented my own "hashing" algorithm. Since the dictionnary always have specific elements, and each value has an idea, I create a string from those ideas and cache that. Thanks all. – Emilien Feb 14 '10 at 10:46
  • 18
    To any looking at this answer's recommendation to use hash() instead of hashlib for non-crypto hashing: keep in mind that in Python 3 and some implementations of Python 2, `hash()` does not return consistent values across different instances and so may be problematic if you're running on a distributed platform and trying to generate cache keys or something like that. For example, every "dyno" (like a linux instance) on Heroku appears to return a different result for `hash()` on the same string. Its only consistent inside the instance itself. GAE reportedly has similar behavior. – B Robster Jan 08 '15 at 01:47
6

Perhaps a naive test... but it looks like it depends on how much you're hashing. 2 blocks of sha512 is faster than 4 blocks of sha256?

>>> import timeit
>>> import hashlib
>>> for sha in [ x for x in dir(hashlib) if x.startswith('sha') ]:
...   t = timeit.Timer("hashlib.%s(data).hexdigest()" % sha,"import hashlib; data=open('/dev/urandom','r').read(1024)")
...   print sha + "\t" + repr(t.timeit(1000))
...
sha1    0.0084478855133056641
sha224  0.034898042678833008
sha256  0.034902095794677734
sha384  0.01980900764465332
sha512  0.019846916198730469
MattH
  • 37,273
  • 11
  • 82
  • 84
  • I got similar results. I think the de facto learning here is that md5 and sha1 are similar in speed (I also benchmarked md5 using this method) and then sha512 is faster than all the hashes in between. Therefore, use sha1 for speed and sha512 for better hashing. The others don't make sense from a performance point of view. – Adam Nelson Dec 15 '10 at 22:55
  • I got very different results, maybe the current implementation or machines are better optimized : `sha1` : `0.00902104377746582` `sha224` : `0.007354021072387695` `sha256` : `0.007508993148803711` `sha384` : `0.004772186279296875` `sha512` : `0.004884004592895508` – user Aug 02 '15 at 12:40