1



I need to generate a unique integer id for a string.

Reason:
I have a database application that can run on different databases. This databases contains parameters with parameter types that are generated from external xml data. the current situation is that i use the ordinal number of the Enum. But when a parameter is inserted or removed, the ordinals get mixed up:
(FOOD = 0 , TOYS = 1) <--> (FOOD = 0, NONFOOD = 1, TOYS = 2)

The ammount of Parameter types is between 200 and 2000, so i am scared a bit using hashCode() for a string.

P.S.: I am using Java.

Thanks a lot

ratchet freak
  • 47,288
  • 5
  • 68
  • 106
Gerald
  • 23
  • 1
  • 1
  • 3
  • If the strings are expected to be short, you should just use the strings themselves. – tofutim Jun 14 '11 at 15:06
  • the strings are up to 32 chars, but there is a table where the ids are put together (parameterGroup * 10000 + parameterType), so they need to have a numeric representation. this is part of a database index and should not be longer than 10 byte, so appending strings together won't work too. – Gerald Jun 14 '11 at 15:10
  • I don't understand why you say you're scared about using `String` implementation of `hashCode()`. Can you explain? – MarcoS Jun 14 '11 at 15:15
  • 1
    because of the possible collision. the hash code is not unique – Gerald Jun 14 '11 at 15:16
  • I don't think there would be a way to generically generate a number like this. If your Strings can be up to 32 characters, an int would not have a large enough range to be able to have a unique number for every possible String. You mentioned hashCode(), but this could return the same number for two different Strings, it could also return negative numbers, which you may not want. – worpet Jun 14 '11 at 15:20
  • @Gerald: if you want an `int` as a result, you can't expect much. One possibility is to use cryptographic algorithms as suggested in answers below, but they will not give you an `int` – MarcoS Jun 14 '11 at 15:21
  • `should not be longer than 10 bytes` are you speaking of the length of its decimal representation (10 digits)? – leonbloy Jun 14 '11 at 15:23
  • 10 bytes is the left space in the DB index. – Gerald Jun 14 '11 at 15:31
  • Why not just execute more caution when updating your enum in Java? – Nick Johnson Jun 15 '11 at 04:29

6 Answers6

5

I would use a mapping table in the database to map these Strings to an auto increment value. These mapping should then be cached in the application.

Kai
  • 38,985
  • 14
  • 88
  • 103
  • Thanks for your answer, i also thought about that solution, i think thats the way i would take when i don't find an ID generation. the thing i don't like is the storing of additional data. – Gerald Jun 14 '11 at 15:23
2

Use a cryptographic hash. MD5 would probably be sufficient and relatively fast. It will be unique enough for your set of input.

How can I generate an MD5 hash?

The only problem is that the hash is 128 bits, so a standard 64-bit integer won't hold it.

Community
  • 1
  • 1
  • No you're not wrong. There are also integers (typed as long or long long) that are larger. In some cases, the integer type is 16 bit, 32, or 64. The types support numbers in the Integer set as opposed to supporting numbers in the Real set. The length just determines how many numbers you can represent in the Integer set. –  Jun 14 '11 at 15:34
1

If you need to be absolute certain that the id are unique (no collissions) and your strings are up to 32 chars, and your number must be of no more than 10 digits (approx 32 bits), you obviously cannot do it by a one way function id=F(string).

The natural way is to keep some mapping of the string to unique numbers (typically a sequence), either in the DB or in the application.

leonbloy
  • 73,180
  • 20
  • 142
  • 190
0

I came across this post that's sensible: How to convert string to unique identifier in Java

In it the author describes his implementation:

public static long longHash(String string) {
  long h = 98764321261L; 
  int l = string.length();
  char[] chars = string.toCharArray();

  for (int i = 0; i < l; i++) {
    h = 31*h + chars[i];
  }
  return h;
}
Philippe
  • 4,088
  • 4
  • 44
  • 49
0

If you know the type of string values (length, letter patterns), you can count the total number of strings in this set and if it fits within 32 bits, the count function is your integer value.

Otherwise, the string itself is your integer value (integer in math terms, not Java).

Nick
  • 5,765
  • 5
  • 27
  • 36
0

By Enum you mean a Java Enum? Then you could give each enum value a unique int by your self instead of using its ordinal number:

public enum MyEnum {

    FOOD(0),
    TOYS(1),

    private final int id;

    private MyEnum(int id)
    {
        this.id = id;
    }
}
Kai
  • 38,985
  • 14
  • 88
  • 103
  • yes, that is true, but from where do i take this number? it must somehow be unique for the string. – Gerald Jun 14 '11 at 15:42
  • @Gerald: When you have to extend the enum take the last id and increment it for the new value... – Kai Jun 14 '11 at 15:44
  • nice idea, but that generator always generates the whole enum. its a stupid script with no memory of things that happen before. – Gerald Jun 14 '11 at 15:47