5

I have few python scripts where I am storing 5-10 million string key value pairs in a dictionary and I query this dictionary around 5-10 million times. I noticed that python dict is not performing very well. Is there any other implementation best suited for string keys.

Edit:

I am having two large lists of person names and I want to match them, so I take one of them as the reference list and try applying different heuristics on each name in second list to figure out if that exists in the first list. So, I have to query first list 2-3 times for every name in second list. Hope, this makes sense.

Boolean
  • 14,266
  • 30
  • 88
  • 129
  • Why aren't you using a database? – Geo Mar 30 '11 at 07:30
  • 1
    Database doesn't make any sense. – Boolean Mar 30 '11 at 07:34
  • 1
    I find it hard to believe that then dictionary lookups are the bottleneck. Python dictionaries are fast, and also have optimisations for the case where all keys are strings. Are you certain that the time isn't being taken 'applying different heuristics'? Have you benchmarked with and without the dictionary lookups? – Duncan Mar 30 '11 at 07:52
  • 1
    I am not sure how to benchmark it. I have used a profiling program, but it gives me only time spend inside each method call and total number of method calls, but my code is just plain code without any method calls. – Boolean Mar 30 '11 at 07:55
  • `from import time / t0 = time() / ...suspected code... / print time()-t0` – Adam Fraser Mar 31 '11 at 16:11
  • @Adam: Why use `time()` when you have cProfile... – kennytm Mar 31 '11 at 17:43
  • Not familiar with cProfile. `time()` is short and sweet though – Adam Fraser Mar 31 '11 at 17:55
  • Could you explain what you mean by "is not performing very well" ? – gurney alex Apr 04 '11 at 08:14
  • what are the values of your dict ? If they are meaningless, use a set instead of a dict. – gurney alex Apr 04 '11 at 08:15
  • You might be able to get better results by using the trie pattern instead of a normal dictionary, but it's tough to say what you should do without a bit more detail regarding what you're doing. – monokrome Mar 30 '11 at 07:25
  • Is there a good implementation of trie in python? – Boolean Mar 30 '11 at 07:28
  • This is probably a separate question, and I haven't ever needed to answer it before... However, searching this site for questions involving "Python" and "Trie" brings up this interesting thread: [http://stackoverflow.com/questions/2406416/implementing-a-patricia-trie-for-use-as-a-dictionary] – monokrome Mar 30 '11 at 07:31
  • By the way, you will get much better results from using a trie. I'm just not sure how much better. – monokrome Mar 30 '11 at 07:34
  • i suspect you will only get better results if the trie is implemented in a c library that python calls. using a pure-python trie implementation is unlikely to help. (i'm afraid i don't know of one, sorry). – andrew cooke Mar 30 '11 at 11:26
  • I believe BioPython has the best Trie implementation for Python. It has efficient prefix matching. – Andreas Mar 31 '11 at 14:41

6 Answers6

1

Wow. A hashmap (dictionary) might not be the structure you're looking for.

Instead of using strings, try a representation that can give you good and fast hashing. Or are you really storing strings? If so, scracth the "might" in the previous sentence.

Could you give us details about the problem you're tackling?

salezica
  • 74,081
  • 25
  • 105
  • 166
1

Questions: Is this a scaling issue ? Are you finding that the code runs more than twice as slow when you have twice as much data? Is it possible that you are running out of physical memory and using swap memory ?

10 million strings of 100 characters each is a gigabyte. If you have 2 sets of those, that would be 2 gigabytes, which is close to the limit of a 32 bit WinXP process.

If you don't already know the answer to this question, I would recommend running a test with the database at various sizes ( powers of 10 or 2 ) and see if the performance curve has a discontinuity.

GeePokey
  • 159
  • 7
0

As Santiago Lezica said, a dictionary is not the structure you're looking for.

Maybe you should try Redis : http://redis.io . It's a advanced key-value store.

There is a library for python here.

Sandro Munda
  • 39,921
  • 24
  • 98
  • 123
0

PyTables http://www.pytables.org/moin It's made to store large datasets. On your case, one dictionary = one table

Monkey
  • 1,838
  • 1
  • 17
  • 24
0

From your description it sounds like you might as well do:

set(names1).intersection(set(names2))

Right?

Either way, it sounds like the problem is that your algorithm is slow, rather than the implementation of Python's hashtables.

Andreas
  • 550
  • 3
  • 12
0

Even if not using classes or method calls, put your code into a function and call that function. Python's functions are highly optimized partly because it accesses local variables faster than global variables.

The Python Performance Tips writeup on the Python wiki is a great read on this topic.

jathanism
  • 33,067
  • 9
  • 68
  • 86