Hash algorithm for english phrases

Question

I'm working on an English words app right now, I want every word has a different int id, as all words are different from one another, I think they simply can be assigned an integer (or long?) easily.

I don't want to give them ids serially according to alphabetical order. I think there may be an existing algorithm for this very requirement, I don't want to invent my own wheels, so, please help me.

I prefer integer id because I want the structure to be compact and small enough for transfer over the internet, because one word list might contain hundreds and thousands of words.

Imagine I have data structure as bellow:

struct word {
  int wordId;
  byte familiarity;
}
// I prefer the mapping like this
apple -> 0x1,  0x4
app   -> 0x2E, 0x2
ape   -> 0xEA, 0x1

UPDATE:

Okay, what I'm trying to do is to provide the users several wordlists, each of which contains a couple of words, chances are the user already learn some of the words(eg. apple), so he/she wants to skip those words, and hope they'll never show up again. So, I want to enable the user to skip those words, and the selected words will be sent to the server or kept in a local file, it might be unnecessary to send the whole word or phrase. I've found a question here: http://stackoverflow.com/questions/7700400/whats-a-good-hash-function-for-english-words, do you have any better solutions?

It's called a dictionary; several good ones, both Oxfords and Websters, are available at your local bookstore. However, it may take you a while to type in all 300,000 words from a good one. — Pieter Geerkens, Mar 18 '13 at 08:29
"I don't want to give them ids serially according to alphabetical order" ― why? — n. m. could be an AI, Mar 18 '13 at 08:30
I see this as a proposed solution for an unknow problem. Please provide some background on the actual business problem being addressed, as something like the LZW encryption algorithm may be much better suited to it than this trial balloon. — Pieter Geerkens, Mar 18 '13 at 08:35
@n.m. because the vocabulary is growing, for instance, in v1.0, my vocabulary is "apple, orange, strawberry", if I give them id serially, it may have a mapping like "apple 1, orange 2, strawberry 3", and in v2.0, I add a lemon, if I assign ids alphabetically, it should be "apple 1, lemon 2, orange 3, strawberry 4", but the ids are already taken. — Liang Zhao, Mar 18 '13 at 08:57
@n.m. It will be the end of the world if I lost the mapping file, I think it's the most natural way to get an identical id for every word in the vocabulary, just use the lexical signature, but how can I compress them into an integer and no collision(sounds like a mission impossible). — Liang Zhao, Mar 18 '13 at 09:19
You can use 5 bit per character (26 for case-insensitive alphabet there are 6 reserved slot left for may be non-alphabetic characters such as `'`, `.`, `-`), or 6 bit per character (52 for case sensitive alphabet, and 12 reserved slot). Then you can encode 12 characters in a word (5 bit) or 10 characters in a word (6 bit), with 4 bit left for collision. As for how to encode the words, prefix is the easiest, but may be problematic for long words starting with the same prefix. If you follow this approach, you should test it out yourself. — nhahtdh, Mar 18 '13 at 09:28
It would be very hard to invent a two-way mapping algorithm between English words and 32-bit or even 64-bit integers, without storing a fairly large amount of additional information (like, the actual mapping file). Even if you manage to create such an algorithm, what will happen if you lose it? — n. m. could be an AI, Mar 18 '13 at 09:49
@nhahtdh thanks, your solution is worth a try, very good one, and yes, collision can be easily found when prefix are long enough. and to n.m. I don't mind whether the processing time is long or not, for I will only calculate the id (the id is generated very time) when the server is starting up, and bind to the word. however, what must be done is the id for a specific word must be the same every time, and identical. so ids can be stored in memory at runtime. — Liang Zhao, Mar 18 '13 at 10:07
You are asking for something impossible. What kind of answers do you expect? — n. m. could be an AI, Mar 18 '13 at 16:22

score 0 · Accepted Answer · edited May 23 '17 at 11:50

Yes, it seems impossible to find a perfect collision free hashing algorithm, I might end up with maintaining a mapping file.

I also find a great question and answer here.

Actually I don't mind the performance of this algorithm for it's all done on server, and only once while starting up. All I want is the id for every word/phrase is unique, as short as possible, like a fingerprint. I wonder if I can take advantage of the prime numbers..

Finally, I decide to use a long for my id

(8 bit) first letter of the first word
(8 bit) last letter of the last word
(4 bit) word count
(4 bit) the serial number of the longest word in the phrase
(8 bit) character count, space included
(32bit) MurmurHash3 result

You can find murmurHash3 cs implementation here:
https://gist.github.com/automatonic/3725443
I think this approach will generate unique ids for any existing words and phrases with no collision.

Hash algorithm for english phrases

1 Answers1