15

I have a 2D Numpy array of integers like so:

a = np.array([[  3,   0,   2,  -1],
              [  1, 255,   1,   2],
              [  0,   3,   2,   2]])

and I have a dictionary with integer keys and values that I would like to use to replace the values of a with new values. The dict might look like this:

d = {0: 1, 1: 2, 2: 3, 3: 4, -1: 0, 255: 0}

I want to replace the values of a that match a key in d with the corresponding value in d. In other words, d defines a map between old (current) and new (desired) values in a. The outcome for the toy example above would be this:

a_new = np.array([[  4,   1,   3,   0],
                  [  2,   0,   2,   3],
                  [  1,   4,   3,   3]])

What would be an efficient way to implement this?

This is a toy example, but in practice the array will be large, its shape will be e.g. (1024, 2048), and the dictionary will have on the order of dozens of elements (34 in my case), and while the keys are integers, they are not necessarily all consecutive and they can be negative (like in the example above).

I need to perform this replacement on hundreds of thousands of such arrays, so it needs to be fast. However, the dictionary is known in advance and remains constant, so asymptotically, any time used to modify the dictionary or transform it into a more appropriate data structure doesn't matter.

I'm currently looping over the array entries in two nested for loops (over the rows and columns of a), but there has got to be a better way.

If the map didn't contain negative values (e.g. -1 like in the example), I would just create a list or an array from the dictionary once where the keys are the array indices and then use that for an efficient Numpy fancy indexing routine. But since there are negative values, too, this won't work.

Alex
  • 3,316
  • 4
  • 26
  • 52
  • 2
    I like this question a lot. Two thoughts: (1) replace the dict with a clever NumPy array as Andy suggests below (there are some other ways you could construct indexer and/or run the raw data value through a function and then an indexer) or (2) consider using a Pandas Series/DataFrame which has some nice replacer methods which may be fast enough. – MrDrFenner Oct 21 '17 at 23:21
  • Good point, I'll look into Pandas data structures! – Alex Oct 21 '17 at 23:29
  • 1
    Possible duplicate of [Fast replacement of values in a numpy array](https://stackoverflow.com/questions/3403973/fast-replacement-of-values-in-a-numpy-array) ... (found this after I answered). – wwii Oct 21 '17 at 23:35
  • @wwii I'm not super convinced by the numbers there, I think if it's a small dict sure, but if it's only has a few times more elements it's going to be much slower. Anyway, I think our two answers are the two solutions to try (and depending on your dict/data one will be faster/best) :) – Andy Hayden Oct 21 '17 at 23:49
  • Will your dictionary have replacements for **all** the unique values in the array? Or are you only replacing a portion or the array values? – wwii Oct 22 '17 at 14:38
  • @wwii Not necessarily all values of the array need to be changed. The way I approached it so far, the dictionary will not necessarily contain a map for all unique array elements, but of course I could always add identity mappings (key = value) for all the unique array elements that should remain unchanged and have a "complete" dictionary if that is useful for some approach. One important thing is: I know the mapping in advance, so the dictionary is created once and then stays constant and is being used to process thousands of large 2D arrays. So the time for altering the dict doesn't matter. – Alex Oct 22 '17 at 15:16
  • There are solutions that are probably faster than others but would require `identity mappings (key = value) for all the unique array elements `. – wwii Oct 22 '17 at 15:22
  • Are dictionary's keys supposed to cover all values in input array? Vice versa, are all values in array supposed to cover all keys in dictionary? In other words, is there a unique one-to-on matching present between array and dictionary? – Divakar Oct 22 '17 at 16:35
  • @Divakar Originally my case was that the dictionary does not necessarily contain a map for all possible unique array values, but I realized that more efficient methods are possible if one can assume that the dictionary does contain a map for all possible unique array values, so the latter is the case now. The map is not necessarily injective though, i.e. the dictionary might map multiple distinct array values to the same new value. – Alex Oct 22 '17 at 21:24
  • 1
    @Alex Please check out my updated solution to leverage the one-to-one mapped case here - https://stackoverflow.com/a/46870227/ Should be pretty efficient. – Divakar Oct 23 '17 at 08:52

5 Answers5

5

Make a copy of the array then iterate over the dictionary items then use boolean indexing to assign the new values to the copy.

import numpy as np
b = np.copy(a)
for old, new in d.items():
    b[a == old] = new
wwii
  • 23,232
  • 7
  • 37
  • 77
5

Here's one way, provided you have a small dictionary/min and max values, this may be more efficient, you work around the negative index by adding the array min:

In [11]: indexer = np.array([d.get(i, -1) for i in range(a.min(), a.max() + 1)])

In [12]: indexer[(a - a.min())]
Out[12]:
array([[4, 1, 3, 0],
       [2, 0, 2, 3],
       [1, 4, 3, 3]])

Note: This moves the for loop to the lookup table, but if this is significantly smaller than the actual array this could be a lot faster.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • This basically allows you to remove the condition "if the map didn't contain negative values" from the question. – Andy Hayden Oct 21 '17 at 23:14
  • Yeah that works! I guess complexity-wise this would be similar to looping over all dictionary items and doing `a[a == key] = value`? Somebody suggested that in another answer here but then strangely deleted it. – Alex Oct 21 '17 at 23:19
  • The good thing with your solution is that I need to create this indexer only once, so its complexity doesn't really matter, even if the dictionary were large (which it isn't). – Alex Oct 21 '17 at 23:22
  • 1
    @Alex I think the complexity of the other is "similar" (in that it's best for small dictionaries), I suspect performance will be ok with both, however my suspicion is that this will be slightly better for larger arrays as it requires only 3 passes. – Andy Hayden Oct 21 '17 at 23:38
  • Will this work if `a[0,0]` is `5`? Said another way, will it work if there isn't a 1:1 correspondence between the keys of `d` and unique values in `a`? – wwii Oct 22 '17 at 14:42
  • 2
    This is the best approach in my case: Since the dictionary remains constant and all possible array values are known in advance, the indexer needs to be created only once and can then be used to process a large number of arrays. If the indexer needs to be created only once, this method is roughly 6 times faster than the method proposed by @wwii. If the indexer needs to be created newly for every array to be processed, then it's not faster I guess. – Alex Oct 22 '17 at 16:07
  • @wwii you have to pick a default value, above -1. – Andy Hayden Oct 22 '17 at 17:35
  • @wwii note in your answer it leaves 'missing from dict' values as they are, which is also a choice (and may not be correct either). – Andy Hayden Oct 22 '17 at 17:38
  • You can replace `d.get(i, -1)` with `d.get(i, i)` if you want to retain the original array's values in cases where there is no corresponding value in the dictionary. – Steven Walton Oct 22 '17 at 17:58
3

This post solves for one-to-one mapping case between array and dictionary keys. The idea would be similar to proposed in @Andy Hayden's smart solution, but we will create a bigger array that incorporates Python's negative indexing thereby giving us the efficiency of simply indexing without any offsetting needed for incoming input arrays, which should be the noticeable improvement here.

To get the indexer, which would be a one-time usage as the dictionary stays the same, use this -

def getval_array(d):
    v = np.array(list(d.values()))
    k = np.array(list(d.keys()))
    maxv = k.max()
    minv = k.min()
    n = maxv - minv + 1
    val = np.empty(n,dtype=v.dtype)
    val[k] = v
    return val

val_arr = getval_array(d)

To get the final replacements, simply index. So, for an input array a, do -

out = val_arr[a]

Sample run -

In [8]: a = np.array([[  3,   0,   2,  -1],
   ...:               [  1, 255,   1, -16],
   ...:               [  0,   3,   2,   2]])
   ...: 
   ...: d = {0: 1, 1: 2, 2: 3, 3: 4, -1: 0, 255: 0, -16:5}
   ...: 

In [9]: val_arr = getval_array(d) # one-time operation

In [10]: val_arr[a]
Out[10]: 
array([[4, 1, 3, 0],
       [2, 0, 2, 5],
       [1, 4, 3, 3]])

Runtime test on tiled sample data -

In [141]: a = np.array([[  3,   0,   2,  -1],
     ...:               [  1, 255,   1, -16],
     ...:               [  0,   3,   2,   2]])
     ...: 
     ...: d = {0: 1, 1: 2, 2: 3, 3: 4, -1: 10, 255: 89, -16:5}
     ...: 

In [142]: a = np.random.choice(a.ravel(), 1024*2048).reshape(1024,2048)

# @Andy Hayden's soln
In [143]: indexer = np.array([d.get(i, -1) for i in range(a.min(), a.max() + 1)])

In [144]: %timeit indexer[(a - a.min())]
100 loops, best of 3: 8.34 ms per loop

# Proposed in this post
In [145]: val_arr = getval_array(d)

In [146]: %timeit val_arr[a]
100 loops, best of 3: 2.69 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
0

Numpy can create vectorized functions for performing mapping operations on arrays. I'm not sure which method here is going to have the best performance, so I've timed my approach with timeit. I would recommend trying out a couple of the other approached provided if you want to figure out what has the best performance.

# Function to be vectorized
def map_func(val, dictionary):
    return dictionary[val] if val in dictionary else val 

# Vectorize map_func
vfunc  = np.vectorize(map_func)

# Run
print(vfunc(a, d))

You can time this by doing:

from timeit import Timer
t = Timer('vfunc(a, d)', 'from __main__ import a, d, vfunc')
print(t.timeit(number=1000))

My result for this approach was about 0.014 s.

Edit: For kicks, I tried this out on (1024, 2048) size numpy array of random numbers from -10 to 10, with your same dictionary. It took about a quarter of a second for a single array. Unless you are running a lot of these arrays, it may not really be worth optimizing if that's an acceptable level of performance.

Steven Walton
  • 406
  • 4
  • 7
  • The documentation of `vectorize` says, "The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.", but I will try it out! – Alex Oct 22 '17 at 00:30
  • Yeah, after testing it out, Andy's approach with the indexer performed much better. I got 0.014 s with his approach, compared to 0.27 s with vectorize. The only adjustment was that, since my test array contained values not present in the dictionary, I changed `indexer = np.array([d.get(i, -1) for i in range(a.min(), a.max() + 1)])` to `indexer = np.array([d.get(i, i) for i in range(a.min(), a.max() + 1)])` to retain the original array's values for cases where there was no corresponding dictionary key. – Steven Walton Oct 22 '17 at 01:27
0

Another option, haven't benchmarked it:

    def replace_values(src: np.ndarray, new_by_old: Dict[int,int]) -> np.ndarray:
        dst = np.empty_like(src)
        for x in np.unique(src):
            dst[src==x] = new_by_old[x]
        return dst

This is similar to https://stackoverflow.com/a/46868897/2135504, but should be a bit faster due to

  • using np.empty_like() instead of np.copy()
  • using np.unique(src) instead of new_by_old.keys()
gebbissimo
  • 2,137
  • 2
  • 25
  • 35