18

First, of all, my apologies if this has been answered elsewhere. All I could find were questions about replacing elements of a given value, not elements of multiple values.

background

I have several thousand large np.arrays, like so:

# generate dummy data
input_array = np.zeros((100,100))
input_array[0:10,0:10] = 1
input_array[20:56, 21:43] = 5
input_array[34:43, 70:89] = 8

In those arrays, I want to replace values, based on a dictionary:

mapping = {1:2, 5:3, 8:6}

approach

At this time, I am using a simple loop, combined with fancy indexing:

output_array = np.zeros_like(input_array)

for key in mapping:
    output_array[input_array==key] = mapping[key]

problem

My arrays have dimensions of 2000 by 2000, the dictionaries have around 1000 entries, so, these loops take forever.

question

is there a function, that simply takes an array and a mapping in the form of a dictionary (or similar), and outputs the changed values?

help is greatly appreciated!

Update:

Solutions:

I tested the individual solutions in Ipython, using

%%timeit -r 10 -n 10

input data

import numpy as np
np.random.seed(123)

sources = range(100)
outs = [a for a in range(100)]
np.random.shuffle(outs)
mapping = {sources[a]:outs[a] for a in(range(len(sources)))}

For every solution:

np.random.seed(123)
input_array = np.random.randint(0,100, (1000,1000))

divakar, method 3:

%%timeit -r 10 -n 10
k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))

mapping_ar = np.zeros(k.max()+1,dtype=v.dtype) #k,v from approach #1
mapping_ar[k] = v
out = mapping_ar[input_array]

5.01 ms ± 641 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

divakar, method 2:

%%timeit -r 10 -n 10
k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))

sidx = k.argsort() #k,v from approach #1

k = k[sidx]
v = v[sidx]

idx = np.searchsorted(k,input_array.ravel()).reshape(input_array.shape)
idx[idx==len(k)] = 0
mask = k[idx] == input_array
out = np.where(mask, v[idx], 0)

56.9 ms ± 609 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

divakar, method 1:

%%timeit -r 10 -n 10

k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))

out = np.zeros_like(input_array)
for key,val in zip(k,v):
    out[input_array==key] = val

113 ms ± 6.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

eelco:

%%timeit -r 10 -n 10
output_array = npi.remap(input_array.flatten(), list(mapping.keys()), list(mapping.values())).reshape(input_array.shape)

143 ms ± 4.47 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

yatu

%%timeit -r 10 -n 10

keys, choices = list(zip(*mapping.items()))
# [(1, 5, 8), (2, 3, 6)]
conds = np.array(keys)[:,None,None]  == input_array
np.select(conds, choices)

157 ms ± 5 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

original, loopy method:

%%timeit -r 10 -n 10
output_array = np.zeros_like(input_array)

for key in mapping:
    output_array[input_array==key] = mapping[key]

187 ms ± 6.44 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Thanks for the superquick help!

Divakar
  • 218,885
  • 19
  • 262
  • 358
warped
  • 8,947
  • 3
  • 22
  • 49
  • 2
    I think this is the same [question](https://stackoverflow.com/questions/3403973/fast-replacement-of-values-in-a-numpy-array). Best answer possibly this [one](https://stackoverflow.com/a/43917704/6091318) – Brenlla May 02 '19 at 09:53
  • As noted below; the first call to list was a mistake; it should be a lot faster without it I think – Eelco Hoogendoorn May 02 '19 at 12:23
  • Note these solutions assume all the source values are greater or equals to 0. Also, it assumes the sources are not too sparse so an array with the len of the max value can fit into memory. – Louis Yang Apr 12 '23 at 01:07

4 Answers4

11

Approach #1 : Loopy one with array data

One approach would be extracting the keys and values in arrays and then use a similar loop -

k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))

out = np.zeros_like(input_array)
for key,val in zip(k,v):
    out[input_array==key] = val

Benefit with this one over the original one is the spatial-locality of the array data for efficient data-fetching, which is used in the iterations.

Also, since you mentioned thousand large np.arrays. So, if the mapping dictionary stays the same, that step to get the array versions - k and v would be a one-time setup process.

Approach #2 : Vectorized one with searchsorted

A vectorized one could be suggested using np.searchsorted -

sidx = k.argsort() #k,v from approach #1

k = k[sidx]
v = v[sidx]

idx = np.searchsorted(k,input_array.ravel()).reshape(input_array.shape)
idx[idx==len(k)] = 0
mask = k[idx] == input_array
out = np.where(mask, v[idx], 0)

Approach #3 : Vectorized one with mapping-array for integer keys

A vectorized one could be suggested using a mapping array for integer keys, which when indexed by the input array would lead us directly to the final output -

mapping_ar = np.zeros(k.max()+1,dtype=v.dtype) #k,v from approach #1
mapping_ar[k] = v
out = mapping_ar[input_array]
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Approach #3 assumes that `input_array` is an array of non-negative integers, and that `k` contains all values of `input_arr`. The second issue can be fixed by replacing `mapping_ar = np.zeros(k.max()+1,dtype=v.dtype)` with `mapping_ar = np.arange(input_arr.max()+1)`, but this will not be efficient if `input_arr` has large values. – bb1 Sep 10 '21 at 00:02
  • In approach #2 the last line should be replaced by `out = np.where(mask, v[idx], input_array)`. – bb1 Sep 10 '21 at 00:53
  • mask = k[idx] == maxes TypeError: only integer scalar arrays can be converted to a scalar index (maxes is from nanargmax) – Flash Thunder Feb 17 '23 at 09:33
4

I think the Divakar #3 method assumes that the mapping dict covers all values (or at least the maximum value) in the target array. Otherwise, to avoid index out of range errors, you have to replace the line

mapping_ar = np.zeros(k.max()+1,dtype=v.dtype) with

mapping_ar = np.zeros(array.max()+1,dtype=v.dtype)

That adds considerable overhead.

Trenton
  • 331
  • 3
  • 4
2

The numpy_indexed library (disclaimer: I am its author) provides functionality to implement this operation in an efficient vectorized maner:

import numpy_indexed as npi
output_array = npi.remap(input_array.flatten(), list(mapping.keys()), list(mapping.values())).reshape(input_array.shape)

Note; I didnt test it; but it should work along these lines. Efficiency should be good for large inputs, and many items in the mapping; I imagine similar to divakars' method 2; not as fast as his method 3. But this solution is aimed more at generality; and it will also work for inputs which are not positive integers; or even nd-arrays (f.i. replacing colors in an image with other colors, etc).

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Thanks! I had to slightly adapt your code for python 3 `mapping.values()` to `list(mapping_values)` – warped May 02 '19 at 12:02
  • oops; put the list around the input instead of the values. Indeed you need the latter; and not the former; it will slow things down a lot for no good reason. Updated my answer – Eelco Hoogendoorn May 02 '19 at 12:22
  • right, my bad. updated the post with your edit. 240 ms performance increase :) – warped May 02 '19 at 12:25
  • Interesting that it is still slower than divakar method 1; are you benchmarking with a mapping with 1000 entries, or a simpler problem like the 3-entry mapping of your example? – Eelco Hoogendoorn May 02 '19 at 12:28
  • test conditions are under the headers solutions, and input data, repsectively. For simplicity, I use the same 1000 by 1000 array, in 10 runs, with 10 loops each – warped May 02 '19 at 12:33
  • Oh you did mention indeed; my bad; 100 entries. Wow would not have expected a direct loop to be faster in that scenario... should go back to that code and profile it sometime, see if there is any obvious bottlenecks. Still going to 1000 entries should see the naive methods fall behind – Eelco Hoogendoorn May 02 '19 at 12:33
1

Given that you're using numpy arrays, I'd suggest you do a mapping using numpy too. Here's a vectorized approach using np.select:

mapping = {1:2, 5:3, 8:6}
keys, choices = list(zip(*mapping.items()))
# [(1, 5, 8), (2, 3, 6)]
# we can use broadcasting to obtain a 3x100x100
# array to use as condlist
conds = np.array(keys)[:,None,None]  == input_array
# use conds as arrays of conditions and the values 
# as choices
np.select(conds, choices)

array([[2, 2, 2, ..., 0, 0, 0],
       [2, 2, 2, ..., 0, 0, 0],
       [2, 2, 2, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])
yatu
  • 86,083
  • 12
  • 84
  • 139