-1

I am using a dictionary structure in python to store (row, col) from a large numpy array. The length of the dictionary object is almost 100,000

The key is a tuple of (row,col). Some of the sample values in this structure are:

OrderedDict([((1783, 586), 0), ((1783, 587), 1), ((1783, 588), 2), ((1783, 589), 3), ((1783, 590), 4), ((1784, 584), 5), ((1784, 585), 6), ((1784, 586), 7), ((1784, 587), 8), ((1784, 588), 9), ((1784, 589), 10), ((1784, 590), 11), ((1784, 591), 12), ((1784, 592), 13), ((1784, 593), 14), ((1784, 594), 15), ((1784, 595), 16), ((1785, 583), 17), ((1785, 584), 18), ((1785, 585), 19), ((1785, 586), 20), ((1785, 587), 21), ((1785, 588), 22), ((1785, 589), 23), ((1785, 590), 24), ((1785, 591), 25), ((1785, 592), 26), ((1785, 593), 27), ((1785, 594), 28), ((1785, 595), 29), ((1785, 596), 30), ((1785, 597), 31),...

The processing is taking forever for lookups using the key.

I perform a lookup using (row,col):

if (1783,586) in keyed_var_pixels:

Based on this post, using in keyword for a dict object should use hashing. For each of the lookup, it seems to take around 0.02 seconds, and a total of 30 mins if running for the entire dataset. This seems too long for a hashed retrieval. I am wondering how I can improve this runtime? Or any alternative data structure to store these values for fast retrieval and existence check.

Thanks in advance!

Neo_32
  • 225
  • 4
  • 13
  • 3
    maybe add a column to the numpy array and store the value there? – Nicolas Martinez Mar 15 '20 at 02:03
  • 2
    This benchmark information should help you understand better: https://stackoverflow.com/a/28860508/3218127 – rkatkam Mar 15 '20 at 02:05
  • 2
    can you provide a reproducible example so we can compare the execution time of each algorithm? – marcos Mar 15 '20 at 02:13
  • 1
    How often does your `if (...) in keyed_var_pixels` return false? If that is reasonably low, you can simply try an `try except` construct instead of an `if` (following the ask for forgiveness rather than permission principle). – coffeinjunky Mar 15 '20 at 02:15
  • 1
    Can you show the code that takes 30 minutes, not just a single lookup? What are you trying to achieve? Why did you not do this using operations on the original NumPy array, it is supposed to be fast. – mkrieger1 Mar 15 '20 at 02:21
  • 2
    No way that takes 0.02 seconds. – Kelly Bundy Mar 15 '20 at 02:34
  • 1
    @NicolasMartinez Thank you. I didn't think about that. I am not too comfortable with numpy, so was going the regular data structure route. – Neo_32 Mar 15 '20 at 03:16
  • @rkatkam Thank you, that's helpful. – Neo_32 Mar 15 '20 at 03:17
  • I think there must be some issue with my code. When I tried creating a random variable with a tuple and performed a performance test, it improved significantly - in the range of 20-30 microseconds. I will analyze it further and update the question. Thank you very much for your help, @Marcos, coffinjunky, mkrieger1, Heap Overflow – Neo_32 Mar 15 '20 at 03:22

1 Answers1

0

I did a few performance test a while ago and a two level dictionary is faster than using a tuple as key:

d = { 1783:{ 586:0, 587:1 ... },  ... }

if 1783 in d and 586 in d[1783] : 
    # ...

or you can define an empty default and do it like this:

notFound = dict()
# ... 
if 586 in d.get(1783,notFound):
   # ...

or this:

value = d.get(1783,notFount).get(586,None)
if value is not None:
   # ...
Alain T.
  • 40,517
  • 4
  • 31
  • 51