51

I need a faster way to store and access around 3GB of k:v pairs. Where k is a string or an integer and v is an np.array() that can be of different shapes.

Is there any object that is faster than the standard python dict in storing and accessing such a table? For example, a pandas.DataFrame?

As far I have understood, python dict is a quite fast implementation of a hashtable. Is there anything better than that for my specific case?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
alec_djinn
  • 10,104
  • 8
  • 46
  • 71
  • 6
    If your using Python 3.5 or lower, then [the dictionary built in in Python 3.6 is said to be 20-25% faster than the old dictionary builtin in Python 3.5](http://stackoverflow.com/questions/39980323/dictionaries-are-implemented-as-ordered-in-cpython-3-6). So you may get better performance using The latest stable version of Python. – Christian Dean Nov 19 '16 at 15:20
  • 3
    Unless your code doesn't do anything else, I'd be quite surprised if dictionary access was your bottleneck. Do you have profiling information showing this is the problem? – DSM Nov 19 '16 at 15:24
  • 2
    I think dict are pretty fast. Instead of finding the alternative, you consider in optimizing rest of your code :) – Moinuddin Quadri Nov 19 '16 at 15:27
  • If your use case involved swapping -- if your data structure were larger than available RAM -- then there would be better answers, but it's not clear whether that's the case. – Charles Duffy Nov 19 '16 at 15:32
  • @leaf that's in memory requirements, I'm waiting for the release of `3.6` in December to add some timing for execution speed (and it *might* actually introduce a slight slowdown if my understanding is correct). – Dimitris Fasarakis Hilliard Nov 19 '16 at 16:41
  • @DSM my code only loops over the dict lots of time – alec_djinn Nov 19 '16 at 16:52
  • @leaf it's for orderd dict only... – alec_djinn Nov 19 '16 at 16:58
  • Do you need the looping to be faster? What exactly are you doing with the data when you iterate through it? There's no context to this question; *"Is there anything faster than the `dict()`?"* in doing what exactly? – Dimitris Fasarakis Hilliard Nov 19 '16 at 17:02
  • @JimFasarakis-Hilliard in accessing the k:v pairs. Is it the best object for the task? Is there something faster, more memory efficient etc... Of course, the goal is to loop over it as fast as possible. – alec_djinn Nov 19 '16 at 17:09
  • 1
    @alec_djinn: if your code only loops over the dict, it's easy to make it faster -- remove the loop! But if your code does something *inside* the loop (say printing, or finding the maximum of the value, or anything other than `pass`), then if that takes longer than the dictionary access (and it almost certainly will), improving dict access won't improve your net performance at all. At this point, you're going to have to show some code if you want any real advice. – DSM Nov 19 '16 at 17:16
  • Is there any way you might vectorize it? What operations are performed on the data? It's also worth pointing out that `DataFrames` are much easier and, I believe, faster to eventually store. – Dimitris Fasarakis Hilliard Nov 19 '16 at 17:16
  • @JimFasarakis-Hilliard `{'some_string':np.array(), ...}` how would you vectorize this? The arrays have different shape. – alec_djinn Nov 19 '16 at 17:19
  • Why are you accessing the values? What do you do with them? Why not include your code so far? – Dimitris Fasarakis Hilliard Nov 19 '16 at 17:22
  • @DSM I am not assuming that looping over the dict is the bottleneck of my code. However, since my code need to access that dictionary over and over I am asking if there is anything better than the dict. I am not interested in other ways of speeding up my code. – alec_djinn Nov 19 '16 at 17:42

5 Answers5

76

No, there is nothing faster than a dictionary for this task and that’s because the complexity of its indexing (getting and setting item) and even membership checking is O(1) in average. (check the complexity of the rest of functionalities on Python doc https://wiki.python.org/moin/TimeComplexity )

Once you saved your items in a dictionary, you can have access to them in constant time which means that it's unlikely for your performance problem to have anything to do with dictionary indexing. That being said, you still might be able to make this process slightly faster by making some changes in your objects and their types that may result in some optimizations at under the hood operations.

e.g. If your strings (keys) are not very large you can intern the lookup key and your dictionary's keys. Interning is caching the objects in memory --or as in Python, table of "interned" strings-- rather than creating them as a separate object.

Python has provided an intern() function within the sys module that you can use for this.

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup...

also ...

If the keys in a dictionary are interned and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer comparison instead of comparing the string values themselves which in consequence reduces the access time to the object.

Here is an example:

In [49]: d = {'mystr{}'.format(i): i for i in range(30)}

In [50]: %timeit d['mystr25']
10000000 loops, best of 3: 46.9 ns per loop

In [51]: d = {sys.intern('mystr{}'.format(i)): i for i in range(30)}

In [52]: %timeit d['mystr25']
10000000 loops, best of 3: 38.8 ns per loop
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 16
    While I think the OP is heading in the wrong direction, performance is about more than big-O. – DSM Nov 19 '16 at 15:41
  • That is the kind of trick I was looking for! It did speed up my code a few percent. Thanks! – alec_djinn Nov 19 '16 at 17:46
  • 1
    Nice answer, but I think the example would be clearer if the values were interned strings too, rather than integers (but it works I guess because integers are interned within this range as an implementation detail) – Chris_Rands Nov 24 '16 at 09:50
  • 1
    @Chris_Rands Indeed. They're smaller than 256 ;-) – Mazdak Nov 24 '16 at 09:52
  • The reason that nothing is faster than dict lookup is not that dict lookup is approx O(1). The reason is that dict lookup in Python is efficiently implemented in C, and has been optimized over the years. This does not exclude that there might be more efficient implementations. It also does not exclude that one might achieve better performance with other methods than hashing, especially if the keys are integers. – Dirk Roorda Mar 01 '21 at 06:56
7

No, I don't think there is anything faster than dict. The time complexity of its index checking is O(1).

-------------------------------------------------------
Operation    |  Average Case  | Amortized Worst Case  |
-------------------------------------------------------
Copy[2]      |    O(n)        |       O(n)            | 
Get Item     |    O(1)        |       O(n)            | 
Set Item[1]  |    O(1)        |       O(n)            | 
Delete Item  |    O(1)        |       O(n)            | 
Iteration[2] |    O(n)        |       O(n)            | 
-------------------------------------------------------

PS https://wiki.python.org/moin/TimeComplexity

akash karothiya
  • 5,736
  • 1
  • 19
  • 29
  • Note that "Faster" and "O-notation" are not the same. If you're talking wallclock time, there are definitely "faster maps" than a python dicts. There is a _whole_ world out there of hashmaps, in all shapes and colors. – TimZaman Jan 04 '23 at 21:54
5

A numpy.array[] and simple dict = {} comparison:

import numpy
from timeit import default_timer as timer

my_array = numpy.ones([400,400])

def read_out_array_values():
    cumsum = 0
    for i in range(400):
        for j in range(400):
            cumsum += my_array[i,j]


start = timer()
read_out_array_values()
end = timer()
print("Time for array calculations:" + str(end - start))


my_dict = {}
for i in range(400):
    for j in range(400):
        my_dict[i,j] = 1

def read_out_dict_values():
    cumsum = 0
    for i in range(400):
        for j in range(400):
            cumsum += my_dict[i,j]
    
start = timer()
read_out_dict_values()
end = timer()
print("Time for dict calculations:" + str(end - start))

Prints:

Time for dict calculations:0.046898419999999996
Time for array calculations:0.07558204099999999
============= RESTART: C:/Users/user/Desktop/dict-vs-numpyarray.py =============
Time for array calculations:0.07849989000000002
Time for dict calculations:0.047769446000000104
poetyi
  • 236
  • 4
  • 13
3

One would think that array indexing is faster than hash lookup.

So if we could store this data in a numpy array, and assume the keys are not strings, but numbers, would that be faster than a python a dictionary?

Unfortunately not, because NumPy is optimized for vector operations, not for individual look up of values. Pandas fares even worse. See the experiment here: https://nbviewer.jupyter.org/github/annotation/text-fabric/blob/master/test/pandas/pandas.ipynb

The other candidate could be the Python array, in the array module. But that is not usable for variable-size values. And in order to make this work, you probably need to wrap it into some pure python code, which will set back all time performance gains that the array offers.

So, even if the requirements of the OP are relaxed, there still does not seem to be a faster option than dictionaries.

Dirk Roorda
  • 101
  • 4
  • Thanks. You stole the question I had in mind. I had assumed that an nparray would be faster for a scenario where key can be the index – Allohvk Jul 06 '22 at 07:48
-2

You can think of storing them in Data structure like Trie given your key is string. Even to store and retrieve from Trie you need O(N) where N is maximum length of key. Same happen to hash calculation which computes hash for key. Hash is used to find and store in Hash Table. We often don't consider the hashing time or computation.

You may give a shot to Trie, Which should be almost equal performance, may be little bit faster( if hash value is computed differently for say

HASH[i] = (HASH[i-1] + key[i-1]*256^i % BUCKET_SIZE ) % BUCKET_SIZE 

or something similar due to collision we need to use 256^i.

You can try to store them in Trie and see how it performs.

sonus21
  • 5,178
  • 2
  • 23
  • 48
  • 4
    The first sentence is kinda misleading. The Big-O notation isn't about speed in general, it's about how the computation time changes when the size of the input changes. So, O(1) could be terribly slow as well as O(n!). – ForceBru Nov 19 '16 at 15:44
  • true, but it doesn't mean it's gonna be _fast_ or _faster_ than O(anything) or even _the fastest_ among all the `O`s possible. It's about _how the amount of work changes when the input size changes_. – ForceBru Nov 19 '16 at 15:51
  • 1
    ...which is to say: In the real world, constant factors matter. An O(1) process that takes a constant-time 1000ms will often be a worse choice than an O(n) process that takes 1ns per item of input. – Charles Duffy Nov 19 '16 at 15:55
  • 2
    O(1) means "constant time", which means "whatever input is, the computation time is the same". Sudoku solving *is* constant time if we just examine all possibilities, but it's way more slower than many algorithms. – Cychih Nov 19 '16 at 15:59