It is never late for a good subject and I am sure people would be interested on my findings.
I needed a hash function and after reading this post and doing a bit of research on the links given here, I came up with this variation of Daniel J Bernstein's algorithm, which I used to do an interesting test:
unsigned long djb_hashl(const char *clave)
{
unsigned long c,i,h;
for(i=h=0;clave[i];i++)
{
c = toupper(clave[i]);
h = ((h << 5) + h) ^ c;
}
return h;
}
This variation hashes strings ignoring the case, which suits my need of hashing users login credentials. 'clave' is 'key' in Spanish. I am sorry for the spanish but its my mother tongue and the program is written on it.
Well, I wrote a program that will generate usernames from 'test_aaaa' to 'test_zzzz', and -to make the strings longer- I added to them a random domain in this list: 'cloud-nueve.com', 'yahoo.com', 'gmail.com' and 'hotmail.com'. Therefore each of them would look like:
test_aaaa@cloud-nueve.com, test_aaab@yahoo.com,
test_aaac@gmail.com, test_aaad@hotmail.com and so on.
Here is the output of the test -'Colision entre XXX y XXX' means 'Collision of XXX and XXX'. 'palabras' means 'words' and 'Total' is the same in both languages-.
Buscando Colisiones...
Colision entre 'test_phiz@hotmail.com' y 'test_juxg@cloud-nueve.com' (1DB903B7)
Colision entre 'test_rfhh@hotmail.com' y 'test_fpgo@yahoo.com' (2F5BC088)
Colision entre 'test_wxuj@hotmail.com' y 'test_pugy@cloud-nueve.com' (51FD09CC)
Colision entre 'test_sctb@gmail.com' y 'test_iohw@cloud-nueve.com' (52F5480E)
Colision entre 'test_wpgu@cloud-nueve.com' y 'test_seik@yahoo.com' (74FF72E2)
Colision entre 'test_rfll@hotmail.com' y 'test_btgo@yahoo.com' (7FD70008)
Colision entre 'test_wcho@cloud-nueve.com' y 'test_scfz@gmail.com' (9BD351C4)
Colision entre 'test_swky@cloud-nueve.com' y 'test_fqpn@gmail.com' (A86953E1)
Colision entre 'test_rftd@hotmail.com' y 'test_jlgo@yahoo.com' (BA6B0718)
Colision entre 'test_rfpp@hotmail.com' y 'test_nxgo@yahoo.com' (D0523F88)
Colision entre 'test_zlgo@yahoo.com' y 'test_rfdd@hotmail.com' (DEE08108)
Total de Colisiones: 11
Total de Palabras : 456976
That is not bad, 11 collisions out of 456,976 (off course using the full 32 bit as table lenght).
Running the program using 5 chars, that is from 'test_aaaaa' to 'test_zzzzz', actually runs out of memory building the table. Below is the output. 'No hay memoria para insertar XXXX (insertadas XXX)' means 'There is not memory left to insert XXX (XXX inserted)'. Basically malloc() failed at that point.
No hay memoria para insertar 'test_epjcv' (insertadas 2097701).
Buscando Colisiones...
...451 'colision' strings...
Total de Colisiones: 451
Total de Palabras : 2097701
Which means just 451 collisions on 2,097,701 strings. Note that in none of the occasions, there were more than 2 collisions per code. Which I confirm it is a great hash for me, as what I need is to convert the login ID to a 40 bit unique id for indexing. So I use this to convert the login credentials to a 32 bit hash and use the extra 8 bits to handle up to 255 collisions per code, which lookign at the test results would be almost impossible to generate.
Hope this is useful to someone.
EDIT:
Like the test box is AIX, I run it using LDR_CNTRL=MAXDATA=0x20000000 to give it more memory and it run longer, the results are here:
Buscando Colisiones...
Total de Colisiones: 2908
Total de Palabras : 5366384
That is 2908 after 5,366,384 tries!!
VERY IMPORTANT: Compiling the program with -maix64 (so unsigned long is 64 bits), the number of collisions is 0 for all cases!!!