0

I'm trying to convert user access log into a pure binary format, which would require me to convert string into int using some hash method, and then the mapping relationship of "id -> string value" would be stored somewhere for further backward retrieve.

Since I'm using Python, in order to save some process time, instead of introducing hashlib to calculate hash, can I simply use

string_hash = id(intern(some_string)) 

as the hash method? Any basic difference to be aware of comparing to MD5 / SHA1? Is the probability of conflict obviously higher than MD5 / SHA1?

Jason Xu
  • 2,903
  • 5
  • 31
  • 54

1 Answers1

2

Doesn't work. id is not guaranteed to be consistent across interpreter executions; in CPython, it's the memory location of the object. Even if it were consistent, it doesn't have enough bytes for collision resistance. Why not just keep using the strings? ASCII or Unicode, strings can be serialized easily.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Hi @user2357112, thanks for the key info. Because the possible calculation about string in my data flow only need to calculate equality between strings, so hashed storage for string may save me space obviously. Why tried to use id() is because it may somehow save repeatedly hashing work on same string in the process flow. Just one question, when "intern(str)" happens, Python internal should calculate some hash in order to locate the "original" string which have same value with the objective string, right? how is it possible to get the hash if yes? – Jason Xu Aug 10 '13 at 04:11
  • Sorry, it seems wrong! As tested on my Python, id('abc') is always 4297746472. Seems it's not memory location related and stable between every restart of the interpreter. My python info is "Python 2.7.5 (default, Aug 2 2013, 13:03:50) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)] on darwin". – Jason Xu Aug 10 '13 at 04:20
  • @JasonHsu -- No it's not. Python interns small strings - that is why "abc" has the same id. Some further reading: http://stackoverflow.com/questions/1136826/what-does-python-sys-intern-do-and-when-should-it-be-used – root Aug 10 '13 at 04:59
  • Thanks for the info. If that case I probably can give a try of using it() since I can limit the string to be short and only contain ASCII, and later use another MD5-based string table for arbitrary UGC when some day the app really need it. However this is quite app-specific and not universal way. BTW, I cannot @ your userid by auto completion. : ) – Jason Xu Aug 10 '13 at 05:22
  • 1
    Though hashing speed is not top performance spot of my application, but I'm still happy to know more about a high performant hash solution on Python. This piece seems quite interesting, http://code.google.com/p/xxhash/ which claims to be extreme fast as 2X than the 2nd. – Jason Xu Aug 10 '13 at 06:02