Speeding up LARGE number of string comparisons in python

Question

I have a large number (several millions) of strings of a predetermined size (several hundred chars). One of the things I am interested in is to calculate a frequency table for these strings. Not so surprisingly the process takes a long time on my test runs.

I was concerned about the dictionary size, but it appears as there is no theoretical upper limit in Python besides the physical available memory. So I could technically create a large dictionary and there should not be any rehashing necessary. Is this a correct assumption?

Additionally would switching to another hash function (perhaps with longer output than 32 bits) than the builtin hash for strings make a significant difference in terms of hash calculation and collisions?

Lastly, I read this interesting question and now I wonder whether or not running things with pypy or one of the other python optimizations suggested would make a significant difference in this case.

I am rather new to Python and haven't got all the bits in place yet. So I'd appreciate if you keep that in mind in your answers.

Why not dump the whole thing into a relational database and let an index do the job? — Daniele Bernardini, Mar 09 '16 at 10:50
@DanieleBernardini because this analysis will be a part in a pipeline that'll run on a cloud service. I don't think it'll be a feasible solution to create a new db every time somebody runs the analysis — posdef, Mar 09 '16 at 11:31
This seems like a good use for some non-relational database options, something like Hadoop and Map-Reduce. — goodguy5, Mar 10 '16 at 19:45

Speeding up LARGE number of string comparisons in python

0 Answers0