1

Imagine we have an array of unique integers. Given an integer (N) of that list, I want to be able to get its index (I) in the array as fast as possible.

My idea was to generate an object that given N returns I. I though of using either a structured array with data type (N,I) and sorted by N or simply a dictionary with keys N.

The search speeds of both methods seems independent of the size of the object which leads me to believe that they are controlled by the overheads. However, I was a bit surprised to find that searching the dictionary was almost 10x faster than searching the structured array. So my questions are:

  1. Why is the dictionary so much faster than my array implementation?
  2. Is there an alternative method that is even faster than these two?

MWE:

from __future__ import division
import numpy as np
import timeit

#Time a function
def Timeme(funct,var,NN=10,NNN=10):
    for i in xrange(NN):
        start =timeit.default_timer()
        for t in xrange(NNN):
            funct(*var)
        end =timeit.default_timer()
        print str(i)+': '+str((end - start)/NNN*1000)  

#Function to build a dictionary        
def mydict(Flist):
    Mydict=dict()
    for n,i in Flist:
        Mydict[n]=i
    return Mydict

#Functions to access the data
def myfd(Mydict,vtest):
    return Mydict[vtest]

def myfs(Flist,vtest):
    n=Flist['N'].searchsorted(vtest)
    return Flist['I'][n] #Flist[n]['I'] is slower

#N=100000  
N=100

# "Allocate empty structured array"
Flist=np.empty(N,dtype=[('N','i4'),('I','i4')])

# "Fill N with randoms and I with sequence"
Flist['N'] = np.random.randint(N*1000,size=N)
Flist['I'] = np.arange(N)

# "Create test value"
ntest=np.random.randint(N)
vtest=Flist['N'][ntest]

# "Sort array on N"
Flist.sort(order='N')

# "Make dictionary"
Mydict=dict(Flist)

# "Get values"    
nrd=myfd(Mydict,vtest)
nrs=myfs(Flist,vtest)

print "Tests OK: " + str(ntest == nrd and ntest == nrs) 

print "\nSearch with Dictionary:"
Timeme(myfd,[Mydict,vtest],NN=5,NNN=100)
print "\nSearch directly in Array:"
Timeme(myfs,[Flist,vtest],NN=5,NNN=100)

Result:

Tests OK: True

Search with Dictionary:
0: 0.000404204885682
1: 0.000409016848607
2: 0.000418640774457
3: 0.000404204885682
4: 0.000394580959833

Search directly in Array:
0: 0.00455211692685
1: 0.00465798011119
2: 0.00458580066732
3: 0.00464354422242
4: 0.00476384329554
Miguel
  • 1,293
  • 1
  • 13
  • 30
  • Why are you using structured array instead of just a classic flat one? – sascha Oct 19 '17 at 17:45
  • When I tested using a flat array it didn't change the speed. If you can think of a way to speedup the process by using a flat array please let me know! – Miguel Oct 19 '17 at 18:43
  • @Miguel again, there is *no way* you will implement faster look-ups than a `dict`. The major trade-off you are making is one of space, though. – juanpa.arrivillaga Oct 19 '17 at 18:44

1 Answers1

1

This could partially be explained by the method-call / function-call overhead. Your dictionary-search function merely does a single operation, indexing, which is translated into a call to my_dict.__getitem__(key), whereas your array-based implementation ultimately calls 3 methods, .searchsorted and __getitem__ twice. Python is a dynamic language, function-calls and, especially, method calls (because of method resolution) are expensive.

But fundamentally, your dict-based implementation should scale better. Python dict objects are highly optimized hash-maps with constant-time search, usually. Your array based implementation is binary search, so it is O(log(n)). You'll see this with a test-case where you select the worst-case, i.e., searching for an element not in the array. Given that searchsorted scales logarithmically, you might have to increase the size of your array dramatically (e.g.100x, 1000x) before you see notable runtime effects.

There is absolutely no chance you'll implement a faster look-up than the built-in dict in Python.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Thank you for the answer. It seems to me that, in my humble opinion, the main disadvantages of the dictionary are 1) Cannot specify the size of the data; 2) Dictionary is slower to build when compared to sorting the array; and 3) The array approach is vectorized. Am I mistaken about any of these disadvantages and/or is there a way to circumvent them? – Miguel Oct 19 '17 at 18:56
  • @Miguel what do you mean by "cannot specify the size of the data"? But with regards to 2), yes, building a dictionary from an array is slower than building another array. With regards to 3)yes, arrays support vectorized operations, but you don't leverage any here... so not sure why that is an advantage... – juanpa.arrivillaga Oct 19 '17 at 19:01
  • With "cannot specify the size of the data" I think I'm referring to what you mentioned in a comment of the main question: "The major trade-off you are making is one of space". Regarding 2) I would just note that we should compare dict vs array + sort, which is why dict is slower but not several orders of magnitude slower. 3) I used vectorized in my application, I just wanted to have a simple(ish) MWE. – Miguel Oct 19 '17 at 19:47
  • @Miguel what I was referring to is that a dictionary is going to be *much* larger than an equivalent structured numpy-array. At least, if your array is int,int. – juanpa.arrivillaga Oct 19 '17 at 19:51
  • Exactly. By specifying the size of the data I can get a much smaller object. I imagine that if someone wanted to optimize the dict function they could allow the user to specify the type of data in the keys and in the values. As far as I know I don't think this is possible. – Miguel Oct 19 '17 at 21:57
  • @Miguel maybe we mean the same thing, but what I'm saying is the size *of the dict* isn't the problem with Python `dict` , it is a general problem with Python objects. The dict actually is a hashmap, which is essentially an array, but the real problem is that it is an array of Py_Object *pointers*, which point to full-fledged Python objects. Note, `import sys; sys.getsizeof(1)` and you'll see what I mean. With numpy, you can actually create primitive arrays, that are essentially wrapped in an object-oriented interface. – juanpa.arrivillaga Oct 19 '17 at 22:04
  • @Miguel so, check out [this](https://stackoverflow.com/questions/43404210/how-much-memory-will-a-list-with-one-million-elements-take-up-in-python/43404344#43404344) – juanpa.arrivillaga Oct 19 '17 at 22:06