Python: Best Dictionary implementation

Question

I have few python scripts where I am storing 5-10 million string key value pairs in a dictionary and I query this dictionary around 5-10 million times. I noticed that python dict is not performing very well. Is there any other implementation best suited for string keys.

Edit:

I am having two large lists of person names and I want to match them, so I take one of them as the reference list and try applying different heuristics on each name in second list to figure out if that exists in the first list. So, I have to query first list 2-3 times for every name in second list. Hope, this makes sense.

I find it hard to believe that then dictionary lookups are the bottleneck. Python dictionaries are fast, and also have optimisations for the case where all keys are strings. Are you certain that the time isn't being taken 'applying different heuristics'? Have you benchmarked with and without the dictionary lookups? — Duncan, Mar 30 '11 at 07:52
I am not sure how to benchmark it. I have used a profiling program, but it gives me only time spend inside each method call and total number of method calls, but my code is just plain code without any method calls. — Boolean, Mar 30 '11 at 07:55
`from import time / t0 = time() / ...suspected code... / print time()-t0` — Adam Fraser, Mar 31 '11 at 16:11
Not familiar with cProfile. `time()` is short and sweet though — Adam Fraser, Mar 31 '11 at 17:55
Could you explain what you mean by "is not performing very well" ? — gurney alex, Apr 04 '11 at 08:14
what are the values of your dict ? If they are meaningless, use a set instead of a dict. — gurney alex, Apr 04 '11 at 08:15
You might be able to get better results by using the trie pattern instead of a normal dictionary, but it's tough to say what you should do without a bit more detail regarding what you're doing. — monokrome, Mar 30 '11 at 07:25
This is probably a separate question, and I haven't ever needed to answer it before... However, searching this site for questions involving "Python" and "Trie" brings up this interesting thread: [http://stackoverflow.com/questions/2406416/implementing-a-patricia-trie-for-use-as-a-dictionary] — monokrome, Mar 30 '11 at 07:31
By the way, you will get much better results from using a trie. I'm just not sure how much better. — monokrome, Mar 30 '11 at 07:34
i suspect you will only get better results if the trie is implemented in a c library that python calls. using a pure-python trie implementation is unlikely to help. (i'm afraid i don't know of one, sorry). — andrew cooke, Mar 30 '11 at 11:26
I believe BioPython has the best Trie implementation for Python. It has efficient prefix matching. — Andreas, Mar 31 '11 at 14:41

score 1 · Answer 1 · answered Mar 30 '11 at 07:20

1

Wow. A hashmap (dictionary) might not be the structure you're looking for.

Instead of using strings, try a representation that can give you good and fast hashing. Or are you really storing strings? If so, scracth the "might" in the previous sentence.

Could you give us details about the problem you're tackling?

answered Mar 30 '11 at 07:20

salezica

74,081
25
105
166

edited the question with details – Boolean Mar 30 '11 at 07:27

score 1 · Answer 2 · answered Aug 07 '11 at 23:43

Questions: Is this a scaling issue ? Are you finding that the code runs more than twice as slow when you have twice as much data? Is it possible that you are running out of physical memory and using swap memory ?

10 million strings of 100 characters each is a gigabyte. If you have 2 sets of those, that would be 2 gigabytes, which is close to the limit of a 32 bit WinXP process.

If you don't already know the answer to this question, I would recommend running a test with the database at various sizes ( powers of 10 or 2 ) and see if the performance curve has a discontinuity.

score 0 · Answer 3 · answered Mar 30 '11 at 07:24

0

As Santiago Lezica said, a dictionary is not the structure you're looking for.

Maybe you should try Redis : http://redis.io . It's a advanced key-value store.

There is a library for python here.

answered Mar 30 '11 at 07:24

Sandro Munda

39,921
24
98
123

score 0 · Answer 4 · answered Mar 30 '11 at 07:24

0

PyTables http://www.pytables.org/moin It's made to store large datasets. On your case, one dictionary = one table

answered Mar 30 '11 at 07:24

Monkey

1,838
1
17
24

score 0 · Answer 5 · answered Mar 31 '11 at 14:44

0

From your description it sounds like you might as well do:

set(names1).intersection(set(names2))

Right?

Either way, it sounds like the problem is that your algorithm is slow, rather than the implementation of Python's hashtables.

answered Mar 31 '11 at 14:44

Andreas

550
3
12

score 0 · Answer 6 · answered Mar 31 '11 at 15:13

Even if not using classes or method calls, put your code into a function and call that function. Python's functions are highly optimized partly because it accesses local variables faster than global variables.

The Python Performance Tips writeup on the Python wiki is a great read on this topic.

Python: Best Dictionary implementation

6 Answers6