Represent string into unique int code

Question

Well my question is how can I represent a string into an int code, I don't want it parsed or converted to int (sort of translating from english to french or german for example).

What I want is to convert the string into an int code that can be used as a search reference, I was going to use the hash code of the string to convert it but since hashing has things to do with environmental settings of the machine is not optimal for my project, I had already considered using the ascii codes of each letter for the word but sadly there are an incredible amount of long words in several languages and the app is globalized so it's not a very viable solution, the project is going to be deployed as an azure cloud site so I don't have full-text search

Any ideas what can I do in this case?

Why would hashing a string depend on your environment settings? — nvoigt, Jan 22 '15 at 15:02
@nvoigt: I believe the 32-bit and 64-bit CLRs do different things. Fundamentally GetHashCode isn't meant for *persistent* values. — Jon Skeet, Jan 22 '15 at 15:03
Hashing is only unique for sufficiently long outputs (128 bits or so, 256 if you're paranoid). With 32 bit integers as hash output collision become likely as you approach 65k strings. For short outputs you need a lookup table or some way of handling collisions. — CodesInChaos, Jan 22 '15 at 15:03
There are more than 2^32 possible strings, so there's no way of getting a truly unique value, unless you basically store it in a database of some form, in which case you can just use a counter. — Jon Skeet, Jan 22 '15 at 15:04
As @JonSkeet said Hashing is not good idea for persistance, one more reason of that is .net may change implementation of GetHashCode() and it keep on doing it. — Rajnikant, Jan 22 '15 at 15:17
How long are these strings? are they phrases or words? Are you translating app strings or building multipurpose dictionary? — cyberhubert, Jan 22 '15 at 15:22
@cyberhubert the strings could be as long as 30 letters per word and there are some special symbols used in the language and it's been built for multipurpose dictionary — criogenist, Jan 22 '15 at 15:29
@cyberhubert because I want to use it as a search reference for example, I search for word that exists in the database and I need the formula to convert that word into numbers and compare numbers in the database instead of comparing strings — criogenist, Jan 22 '15 at 15:41
@criogenist: Well just do the lookup in the database then - that's what it's for. You can use the database ID where you already know it, and the string form where you don't. — Jon Skeet, Jan 22 '15 at 15:44
@JonSkeet the problem with that approach is that the string comparison takes more time than an int comparison, I've already considered that but in response time it's slow and when the site grows larger I could get a timeout from the server — criogenist, Jan 22 '15 at 15:47
@criogenist: If you have an index on the database, finding the right ID should be *extremely* fast. You just need to make sure you *do* have the right index. — Jon Skeet, Jan 22 '15 at 15:48
That's the problem Azure SQL server does not support full-text index or full-text search @JonSkeet — criogenist, Jan 22 '15 at 15:52
You would *not* use full-text to lookup a single word in a word-list like table - isn't that all you need to do? — Alex K., Jan 22 '15 at 15:53
Indeed - your query is just something like `select id from words where word=@word`. — Jon Skeet, Jan 22 '15 at 15:54
you can even join with the target language table to get response in one query: `select words_de.word from words_de inner join words on words_de.id=words.id where words.word=@word` — cyberhubert, Jan 22 '15 at 15:57
That returns me to the first instance, the timeout, as I lookup in the table there would be several matches and it should search in the whole table to find it, at first it would work but since I'm getting information from **several tables** it wouldn't be very optimal since I have to bind the different words with the different ids of **each post** in that tables, and each post will have an approximate of 500 words — criogenist, Jan 22 '15 at 16:03
you can also consider just one table with languages mapped to table columns. This should be really fast — cyberhubert, Jan 22 '15 at 16:06
If these are to be stored in a database, you should _let the database do the work_. Mark the values as keys, and any proper DB will handle all the necessary optimization/indexing behind the scenes for you. Looking things up by `int` isn't going to be noticeably faster than by `string` (never mind that other than by tokenizing a `string` value, you can't make a 32-bit or 64-bit index value that is guaranteed unique for a given `string`) — Peter Duniho, Jan 22 '15 at 19:52

score 1 · Answer 1 · edited May 23 '17 at 10:26

1

You are already giving a solution instead of giving requirements, so it may very well be that there are better options.

Anyway, you could use a platform independent hash with the managed cryptography classes like SHA512Managed. Be aware however that this is not guaranteed unique, so you might end up with collisions; but at least it's built in and you don't have to reinvent the wheel. Go here for an example.

edited May 23 '17 at 10:26

Community

1
1

answered Jan 22 '15 at 15:16

L-Four

13,345
9
65
109

While SHA512 has collisions in theory, the probability of them happening is astronomically small and can be safely neglected. Even considering deliberate collisions, nobody has ever published one. – CodesInChaos Jan 22 '15 at 15:58
SHA512 might be a reliable way to generate a unique(-enough) value for a string value. But it's not going to be very practical as an index, which is what the OP seems to be asking for. – Peter Duniho Jan 22 '15 at 19:49

score 1 · Accepted Answer · answered Jul 01 '15 at 21:24

1

I solved this by creating another table that contained just the words from each post and calling it with the ID of each word

answered Jul 01 '15 at 21:24

criogenist

11
4

score 0 · Answer 3 · answered Jan 22 '15 at 15:18

One such hash is getting the sum char code of each letter int hash = s.Select<char, int>(x => (int)x).Aggregate((x, y) => x + y); however that hash has collisions. Another way is concatenate the char code of each letter, however you quickly surpass the number allowed per integer. One such work around for this is subtracting 64 from the value of the chars uint hash = Convert.ToUInt32(s.Select<char, string>(x => (((int)x) - 64).ToString()).Aggregate((x, y) => x + y));

Represent string into unique int code

3 Answers3