1

I have working code to perform a nested dictionary lookup and append results of another lookup to each key's list using the results of numpy's nonzero lookup function. Basically, I need a list of strings appended to a dictionary. These strings and the dictionary's keys are hashed at one point to integers and kept track of using separate dictionaries with the integer hash as the key and the string as the value. I need to look up these hashed values and store the string results in the dictionary. It's confusing so hopefully looking at the code helps. Here's a simplified version of code:

for key in ResultDictionary:
        ResultDictionary[key] = []

true_indices = np.nonzero(numpy_array_of_booleans)
for idx in range(0, len(true_indices[0])):
    ResultDictionary.get(HashDictA.get(true_indices[0][idx])).append(HashDictB.get(true_indices[1][idx]))

This code works for me, but I am hoping there's a way to improve the efficiency. I am not sure if I'm limited due to the nested lookup. The speed is also dependent on the number of true results returned by the nonzero function. Any thoughts on this? Appreciate any suggestions.

yeah3x
  • 23
  • 1
  • 4

2 Answers2

0

You can't do much about the dictionary lookups - you have to do those one at a time.

You can clean up the array indexing a bit:

idxes = np.argwhere(numpy_array_of_booleans)
for i,j in idxes:
    ResultDictionary.get(HashDictA.get(i)).append(HashDictB.get(j)

argwhere is transpose(nonzero(...)), turning the tuple of arrays into a (n,2) array of index pairs. I don't think this makes a difference in speed, but the code is cleaner.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

Here are two suggestions:

1) since your hash dicts are keyed with ints it might help to transform them into arrays or even lists for faster lookup if that is an option.

k, v = map(list, (HashDictB.keys(), HashDictB.values())
mxk, mxv = max(k), max(v, key=len)
lookupB = np.empty((mxk+1,), dtype=f'U{mxv}')
lookupB[k] = v

2) you probably can save a number of lookups in ResultDictionary and HashDictA by processing your numpy_array_of_booleans row-wise:

i, j = np.where(numpy_array_of_indices)
bnds, = np.where(np.r_[True, i[:-1] != i[1:], True])
ResultDict = {HashDictA[i[l]]: [HashDictB[jj] for jj in j[l:r]] for l, r in zip(bnds[:-1], bnds[1:])}

2b) if for some reason you need to incrementally add associations you could do something like (I'll shorten variable names for that)

from operator import itemgetter
res = {}

def add_batch(data, res, hA, hB):
    i, j = np.where(data)
    bnds, = np.where(np.r_[True, i[:-1] != i[1:], True])
    for l, r in zip(bnds[:-1], bnds[1:]):
        if l+1 == r:
            res.setdefault(hA[i[l]], set()).add(hB[j[l]])
        else:
            res.setdefault(hA[i[l]], set()).update(itemgetter(*j[l:r])(hB))
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Solution 2 has given me an ~5x speedup on some initial testing which is awesome. However, I am running this code in a loop and not all ResultDict's keys get used every loop. The keys not used in that loop get their value (the list of values) overwritten by None. Is there a simple way to preserve they unused key's value, or do I need to do some work before the loop to save those off and then add them back to the dictionary after? Due to the dictionary comprehension I don't think I can use "ResultDict.get" instead of "ResultDict =". – yeah3x Feb 05 '18 at 21:41
  • Ok so edit window expired but... Edit: it seems ResultDict .update({HashDictA[i[l]]: ...}) did the trick somehow. Slightly slower than you posted but still faster than what I started with by a good amount! About 3x. – yeah3x Feb 05 '18 at 21:54
  • @yeah3x But if a previously used key of HashDictA reappears the old list of associated HashDictB entries will be discarded and replaced by the new one. Is that what you want? Probably not, right? I'd guess you'd want to combine the lists? – Paul Panzer Feb 05 '18 at 22:27
  • The solution works for my current implementation, but yeah a combination of the lists would be preferred. I would not want to add repeats, so would I have to look into maybe using a set for the value instead of a list? Checking if each new item is already in the list makes me think it would be an expensive operation but I do not have the most extensive Python background to say that for sure. – yeah3x Feb 06 '18 at 16:23
  • @yeah3x You're right, sets are a good solution for that. I've updated the post. – Paul Panzer Feb 06 '18 at 17:04
  • So I am getting an error about 'int' object is not iterable. After checking this anwer ( https://stackoverflow.com/questions/28845284/add-vs-update-in-set-operations-in-python ) it seems set's update operation needs an iterable. So is there way to fix cases where one of the sets is a single integer? Never used sets before. Would I need an if check before to see if I need an 'update' or 'add' operation? Would putting the iterable in a list be a fix? Not sure if that would then give me a set of a list? – yeah3x Feb 06 '18 at 18:51
  • So for my specific application I've been double checking and I do not think I need to get the set approach working. For anyone reading this, update([itemgetter(*j[l:r])(hB)]) makes even the single integer sets iterable, but causes multiple values in sets to be in a list. I've accepted your answer. Thanks so much for the help. – yeah3x Feb 06 '18 at 19:20
  • @yeah3x I think there is nothing better than a good old `if ... else`, I've updated the answer. – Paul Panzer Feb 07 '18 at 05:07