1

While learning to create a hashmap using Numpy and Python 3, I came up with the following code which uses a Numpy structured array data.

However, the time it takes to select a value from a key is quite slow, as shown in the timeit runs comparing 13.3 secs for Numpy structured array data with 0.008 secs for Python dictionary d.

val = data[data['keys'] == key]['values'][0]

Is there a faster way to get the item for a particular key?

import numpy as np
import timeit

N = 1000*1000
keyArr = np.random.randint(0, 1000*1000*1000*4, N)
valArr = np.random.rand(N)
key = keyArr[0]                                     # Select an existing key value

# Numpy structured array
data = np.empty(keyArr.shape[0], dtype=[('keys', keyArr.dtype), ('values', valArr.dtype)])
data['keys'] = keyArr
data['values'] = valArr

val = data[data['keys'] == key]['values'][0]
print(key, '=>', val)                               # 558520981 => 0.17948995177905835
print( timeit.Timer("data[data['keys'] == key]['values'][0]", 
    globals=globals()).timeit(10*1000) , 'secs' )   # 13.256318201000001 secs

# Python built-in dictionary
d = {}
for k, v in zip(keyArr, valArr):
    d[k] = v

print(key, '=>', d[key])                            # 558520981 => 0.17948995177905835
print( timeit.Timer("d[key]",       
    globals=globals()).timeit(10*1000) , 'secs' )   # 0.0008061910000000116 secs

Numpy Lookup Method 1: 13.3 secs

val = data[data['keys'] == key]['values'][0]

Numpy Lookup Method 2: 13.4 secs

val = data['values'][np.where(data['keys'] == key)][0]

pandas.Series: 6.8 secs

import pandas as pd

# Pandas Series
s = pd.Series(valArr, index=keyArr, dtype=valArr.dtype)
val = s[key]
print(key, '=>', val)
print( timeit.Timer("s[key]", 
    globals=globals()).timeit(10*1000) , 'secs' )   # 6.782590246000002 secs
Athena Wisdom
  • 6,101
  • 9
  • 36
  • 60
  • have you looked into pandas? – RichieV Sep 04 '20 at 18:28
  • @RichieV No, I assumed `pandas` will be slower than `numpy` since it is built on top of `numpy` and with additional features. – Athena Wisdom Sep 04 '20 at 18:43
  • 1
    It is definitely slower on normal numpy operations, but it is optimized for this kind of methods, like filtering with row/column labels – RichieV Sep 04 '20 at 18:45
  • @RichieV Good point, let me give pandas a try and update the question with my findings. I'm guessing `pandas.Series` will be suitable. – Athena Wisdom Sep 04 '20 at 18:48
  • @RichieV Updated the question with the results when using `pandas`. The `pandas.Series` has a 2X faster lookup than the numpy structured array, but it is still about 1000X slower than Python's dictionary – Athena Wisdom Sep 04 '20 at 18:55
  • The pandas series (55.3 MB) also takes up more space than the Python dictionary (40 MB). The numpy structured arrray takes up the least memory (15.3 MB). – Athena Wisdom Sep 04 '20 at 18:58
  • That was expected, read this question about [large data workflows using pandas](https://stackoverflow.com/q/14262433/6692898) – RichieV Sep 04 '20 at 19:23

1 Answers1

1

The main source of the problem is that lookup operations like these of numpy and pandas need to check every element in the list, as they are intended to perform multiple selection and more complex lookup operations, too. However, python dictionary can only perform single match lookups, but it's an optimal implementation with binary trees.

So, if your intention is to stick to key access, I don't think you'll find anything faster than a dictionary. Otherwise, I'd put my bet on pandas for the fastest access times.

David
  • 453
  • 1
  • 4
  • 14