I have a large number (several millions) of strings of a predetermined size (several hundred chars). One of the things I am interested in is to calculate a frequency table for these strings. Not so surprisingly the process takes a long time on my test runs.
I was concerned about the dictionary size, but it appears as there is no theoretical upper limit in Python besides the physical available memory. So I could technically create a large dictionary and there should not be any rehashing necessary. Is this a correct assumption?
Additionally would switching to another hash function (perhaps with longer output than 32 bits) than the builtin hash for strings make a significant difference in terms of hash calculation and collisions?
Lastly, I read this interesting question and now I wonder whether or not running things with pypy or one of the other python optimizations suggested would make a significant difference in this case.
I am rather new to Python and haven't got all the bits in place yet. So I'd appreciate if you keep that in mind in your answers.