1

I have a large number (several millions) of strings of a predetermined size (several hundred chars). One of the things I am interested in is to calculate a frequency table for these strings. Not so surprisingly the process takes a long time on my test runs.

I was concerned about the dictionary size, but it appears as there is no theoretical upper limit in Python besides the physical available memory. So I could technically create a large dictionary and there should not be any rehashing necessary. Is this a correct assumption?

Additionally would switching to another hash function (perhaps with longer output than 32 bits) than the builtin hash for strings make a significant difference in terms of hash calculation and collisions?

Lastly, I read this interesting question and now I wonder whether or not running things with pypy or one of the other python optimizations suggested would make a significant difference in this case.

I am rather new to Python and haven't got all the bits in place yet. So I'd appreciate if you keep that in mind in your answers.

Community
  • 1
  • 1
posdef
  • 6,498
  • 11
  • 46
  • 94
  • 1
    Why not dump the whole thing into a relational database and let an index do the job? – Daniele Bernardini Mar 09 '16 at 10:50
  • @DanieleBernardini because this analysis will be a part in a pipeline that'll run on a cloud service. I don't think it'll be a feasible solution to create a new db every time somebody runs the analysis – posdef Mar 09 '16 at 11:31
  • This seems like a good use for some non-relational database options, something like Hadoop and Map-Reduce. – goodguy5 Mar 10 '16 at 19:45

0 Answers0