8

I want to generate int from a string and be able to generate it back. Something like hash function but two-way function. I want to use ints as ID in my application, but want to be able to convert it back in case of logging or debugging.

Like:

int id = IDProvider::getHash("NameOfMyObject");

object * a = createObject(id);

...

if(error)
{
    LOG(IDProvider::getOriginalString(a->getId()), "some message");
}

I have heard of slightly modified CRC32 to be fast and 100% reversible, but I can not find it and I am not able to write it by myself.

Any hints what should I use? Thank you!

edit I have just founded the source I have the whole CRC32 thing from:

Jason Gregory : Game Engine Architecture

quotation:

"As with any hashing system, collisions are a possibility (i.e., two different strings might end up with the same hash code). However, with a suitable hash function, we can all but guarantee that collisions will not occur for all reasonable input strings we might use in our game. After all, a 32-bit hash chode represents more than four billion possible values. So if our hash function does a good job of distributing strings evently throughout this very large range, we are unlikely to collide. At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune."

relaxxx
  • 7,566
  • 8
  • 37
  • 64
  • 10
    That's not what hashing means. – SLaks Oct 30 '11 at 19:00
  • Are you talking about this? Only a special case of CRC32 is reversible: http://stackoverflow.com/questions/1514040/reversing-crc32 – wkl Oct 30 '11 at 19:02
  • 2
    no crc32 doesn't do you what you think it does, its main purpose is to detect bit errors. – AndersK Oct 30 '11 at 19:03
  • 2
    @SLaks I know what hashing means, but I couldn't find out other word. That is why I use quotation marks, and wrote `something like` and use the term `two-way` – relaxxx Oct 30 '11 at 19:12
  • 1
    This won't give an int, and it is not bound to C++, but for search sake, one might be interested in base64 encoding (e.g. in bash, `echo whatever|base64` gives `d2hhdGV2ZXIK`, which `base64 -d` decodes back into `whatever`). – Skippy le Grand Gourou Mar 23 '17 at 13:18

6 Answers6

22

Reducing an arbitrary length string to a fixed size int is mathematically impossible to reverse. See Pidgeonhole principle. There is a near infinite amount of strings, but only 2^32 32 bit integers.

32 bit hashes(assuming your int is 32 bit) can have collisions very easily. So it's not a good unique ID either.

There are hashfunctions which allow you to create a message with a predefined hash, but it most likely won't be the original message. This is called a pre-image.

For your problem it looks like the best idea is creating a dictionary that maps integer-ids to strings and back.


To get the likelyhood of a collision when you hash n strings check out the birthday paradox. The most important property in that context is that collisions become likely once the number of hashed messages approaches the squareroot of the number of available hash values. So with a 32 bit integer collisions become likely if you hash around 65000 strings. But if you're unlucky it can happen much earlier.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • Thank you for your answer. I do not want to make it with dictionary by various reasons. I am curious about the modification of some hash function, please, see the edit of my question – relaxxx Oct 30 '11 at 19:28
  • @relax the collision probability depends on the number of strings you hash and the size of your hash. If you want no practical collisions I recommend 128 bit+ hashes. But you still won't get back the original string easily. – CodesInChaos Oct 30 '11 at 20:49
17

I have exactly what you need. It is called a "pointer". In this system, the "pointer" is always unique, and can always be used to recover the string. It can "point" to any string of any length. As a bonus, it also has the same size as your int. You can obtain a "pointer" to a string by using the & operand, as shown in my example code:

#include <string>
int main() {
    std::string s = "Hai!";
    std::string* ptr = &s; // this is a pointer
    std::string copy = *ptr; // this retrieves the original string
    std::cout << copy; // prints "Hai!"
}
Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 1
    +1 for the formulation. :) Though, to be fair, pointer values will not be the same between runs. – Xeo Oct 30 '11 at 19:10
  • 4
    This has a number of problems: 1) The ID can only be used within a single process 2) You need to keep the string alive long enough 3) If you apply this to several strings that are equal content wise, but not reference wise you get different IDs. – CodesInChaos Oct 30 '11 at 19:11
  • @CodeInChaos thank you, I was just about to write it by myself – relaxxx Oct 30 '11 at 19:17
  • @CodeInChaos: You can use `boost` to obtain interprocess pointers. You can also use other kinds of smart pointer to serialize pointers. Secondly, he was only going to use a few constant strings anyway, so lifetime and reference issues aren't that big of a deal. – Puppy Oct 30 '11 at 19:23
  • 1
    "As a bonus, it also has the same size as your int." I32LP64 disagrees – Pod May 06 '16 at 11:45
4

What you need is encryption. Hashing is by definition one way. You might try simple XOR Encryption with some addition/subtraction of values.

... and many more via google search...

Community
  • 1
  • 1
HashIsGood
  • 41
  • 1
  • I know hash function is one way. That is why I wrote about the `modification of hash function`. Please, see the edit of my question – relaxxx Oct 30 '11 at 19:31
3

You could look at perfect hashing

http://en.wikipedia.org/wiki/Perfect_hash_function

It only works when all the potential strings are known up front. In practice what you enable by this, is to create a limited-range 'hash' mapping that you can reverse-lookup.

In general, the [hash code + hash algorithm] are never enough to get the original value back. However, with a perfect hash, collisions are by definition ruled out, so if the source domain (list of values) is known, you can get the source value back.

gperf is a well-known, age old program to generate perfect hashes in c/c++ code. Many more do exist (see the Wikipedia page)

sehe
  • 374,641
  • 47
  • 450
  • 633
  • In practice I'd almost always take a dictionary over this. But occasionally it might offer performance benefits. – CodesInChaos Oct 30 '11 at 19:12
  • Agree. One benefit,though, is that it is possible to share _just_ the hash function. Now any client can arrive at the correct hash sum without having to know the full table of possible input values. That is a big boon for distributed software with a relatively large source domain – sehe Oct 30 '11 at 19:14
1

Is it not possible. Hashing is not-returnable function - by definition.

Max
  • 842
  • 5
  • 16
0

As everyone mentioned, it is not possible to have a "reversible hash". However, there are alternatives (like encryption).

Another one is to zip/unzip your string using any lossless algorithm.

That's a simple, fully reversible method, with no possible collision.

tm.crz
  • 1