68

I have a very large numpy array (containing up to a million elements) like the one below:

[0,1,6,5,1,2,7,6,2,3,8,7,3,4,9,8,5,6,11,10,6,7,12,11,7,
8,13,12,8,9,14,13,10,11,16,15,11,12,17,16,12,13,18,17,13,
14,19,18,15,16,21,20,16,17,22,21,17,18,23,22,18,19,24,23]

and a small dictionary map for replacing some of the elements in the above array

{4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

I would like to replace some of the elements according to the map above. The numpy array is really large, and only a small subset of the elements (occurring as keys in the dictionary) will be replaced with the corresponding values. What is the fastest way to do this?

Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
D R
  • 21,936
  • 38
  • 112
  • 149
  • 1
    Related question: https://stackoverflow.com/questions/16992713/translate-every-element-in-numpy-array-according-to-key – AMC Feb 06 '20 at 00:35
  • Related question, with the best solution I found on SO: https://stackoverflow.com/questions/55949809/efficiently-replace-elements-in-array-based-on-dictionary-numpy-python – toliveira Aug 28 '20 at 19:19

10 Answers10

49

I believe there's even more efficient method, but for now, try

from numpy import copy

newArray = copy(theArray)
for k, v in d.iteritems(): newArray[theArray==k] = v

Microbenchmark and test for correctness:

#!/usr/bin/env python2.7

from numpy import copy, random, arange

random.seed(0)
data = random.randint(30, size=10**5)

d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
dk = d.keys()
dv = d.values()

def f1(a, d):
    b = copy(a)
    for k, v in d.iteritems():
        b[a==k] = v
    return b

def f2(a, d):
    for i in xrange(len(a)):
        a[i] = d.get(a[i], a[i])
    return a

def f3(a, dk, dv):
    mp = arange(0, max(a)+1)
    mp[dk] = dv
    return mp[a]


a = copy(data)
res = f2(a, d)

assert (f1(data, d) == res).all()
assert (f3(data, dk, dv) == res).all()

Result:

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)'
100 loops, best of 3: 6.15 msec per loop

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)'
100 loops, best of 3: 19.6 msec per loop
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • Iterating like `for k in d` would make this as fast as possible` – jamylak Jun 07 '13 at 21:17
  • A vote against `numpy.place` as mentioned by @katrielalex , as it just wasted around twenty to thirty hours of my time by being buggy; apparently its use is relatedly discouraged. "I would generally suggest to use `np.copyto` or (in this case) boolean fancy indexing to achieve the same and avoid `np.place` or `np.putmask`. I realize that in some cases those functions are not quite 1:1 replaces by these." FWIW I didn't have this bug, but another one where it was silently just failing to work. – ijoseph Nov 28 '16 at 04:46
  • For anyone wondering, the command is now `for k, v in d.items(): newArray[theArray==k] = v` – bela83 May 23 '17 at 09:01
  • Note: If there's three times more items in the dict, it's going to be three times as slow (I don't think the f3 version will be hit as much). – Andy Hayden Oct 21 '17 at 23:45
28

Assuming the values are between 0 and some maximum integer, one could implement a fast replace by using the numpy-array as int->int dict, like below

mp = numpy.arange(0,max(data)+1)
mp[replace.keys()] = replace.values()
data = mp[data]

where first

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and replacing with

replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

we obtain

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  0  5  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16
  1  0 16 17  2  1 17 18  3  2 18 15  0  3]
D R
  • 21,936
  • 38
  • 112
  • 149
  • Also note the `digitize` function, shown in the accepted answer to this question: http://stackoverflow.com/questions/13572448/change-values-in-a-numpy-array – Stefan van der Walt Jan 16 '15 at 17:20
14

I benchmarked some solutions, and the result is without appeal :

import timeit
import numpy as np

array = 2 * np.round(np.random.uniform(0,10000,300000)).astype(int)
from_values = np.unique(array) # pair values from 0 to 2000
to_values = np.arange(from_values.size) # all values from 0 to 1000
d = dict(zip(from_values, to_values))

def method_for_loop():
    out = array.copy()
    for from_value, to_value in zip(from_values, to_values) :
        out[out == from_value] = to_value
    print('Check method_for_loop :', np.all(out == array/2)) # Just checking
print('Time method_for_loop :', timeit.timeit(method_for_loop, number = 1))

def method_list_comprehension():
    out = [d[i] for i in array]
    print('Check method_list_comprehension :', np.all(out == array/2)) # Just checking
print('Time method_list_comprehension :', timeit.timeit(method_list_comprehension, number = 1))

def method_bruteforce():
    idx = np.nonzero(from_values == array[:,None])[1]
    out = to_values[idx]
    print('Check method_bruteforce :', np.all(out == array/2)) # Just checking
print('Time method_bruteforce :', timeit.timeit(method_bruteforce, number = 1))

def method_searchsort():
    sort_idx = np.argsort(from_values)
    idx = np.searchsorted(from_values,array,sorter = sort_idx)
    out = to_values[sort_idx][idx]
    print('Check method_searchsort :', np.all(out == array/2)) # Just checking
print('Time method_searchsort :', timeit.timeit(method_searchsort, number = 1))

And I got the following results :

Check method_for_loop : True
Time method_for_loop : 2.6411612760275602

Check method_list_comprehension : True
Time method_list_comprehension : 0.07994363596662879

Check method_bruteforce : True
Time method_bruteforce : 11.960559037979692

Check method_searchsort : True
Time method_searchsort : 0.03770717792212963

The "searchsort" method is almost a hundred times faster than the "for" loop, and about 3600 times faster than the numpy bruteforce method. The list comprehension method is also a very good trade-off between code simplicity and speed.

Dave Liu
  • 906
  • 1
  • 11
  • 31
Jean Lescut
  • 1,387
  • 1
  • 12
  • 17
10

Another more general way to achieve this is function vectorization:

import numpy as np

data = np.array([0, 1, 6, 5, 1, 2, 7, 6, 2, 3, 8, 7, 3, 4, 9, 8, 5, 6, 11, 10, 6, 7, 12, 11, 7, 8, 13, 12, 8, 9, 14, 13, 10, 11, 16, 15, 11, 12, 17, 16, 12, 13, 18, 17, 13, 14, 19, 18, 15, 16, 21, 20, 16, 17, 22, 21, 17, 18, 23, 22, 18, 19, 24, 23])
mapper_dict = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

def mp(entry):
    return mapper_dict[entry] if entry in mapper_dict else entry
mp = np.vectorize(mp)

print mp(data)
5

The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:

import numpy_indexed as npi
remapped_array = npi.remap(theArray, list(dict.keys()), list(dict.values()))

The method implemented is similar to the searchsorted based approach mentioned by Jean Lescut, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves; yet it should achieve the same kind of performance.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • could you modify this one-liner to achieve replacing elements of the original list that do not constitute dictionary keys with something else, say for example a constant ? (as opposed to leaving the original value) – Tony Jun 19 '17 at 01:34
  • Not with the current verrsion of the package, but the remap function has a 'missing' kwarg, and if you open the source of the function, you will see that adding this kind of behavior would indeed be easy. Ill do that for an upcoming release; until then feel free to copy-paste the source if you wish. – Eelco Hoogendoorn Jun 19 '17 at 06:19
4

No solution was posted still without a python loop on the array (except Celil's one, which however assume numbers are "small"), so here is an alternative:

def replace(arr, rep_dict):
    """Assumes all elements of "arr" are keys of rep_dict"""

    # Removing the explicit "list" breaks python3
    rep_keys, rep_vals = array(list(zip(*sorted(rep_dict.items()))))

    idces = digitize(arr, rep_keys, right=True)
    # Notice rep_keys[digitize(arr, rep_keys, right=True)] == arr

    return rep_vals[idces]

the way "idces" is created comes from here.

Community
  • 1
  • 1
Pietro Battiston
  • 7,930
  • 3
  • 42
  • 45
4

A fully vectorized solution using np.in1d and np.searchsorted:

replace = numpy.array([list(replace.keys()), list(replace.values())])    # Create 2D replacement matrix
mask = numpy.in1d(data, replace[0, :])                                   # Find elements that need replacement
data[mask] = replace[1, numpy.searchsorted(replace[0, :], data[mask])]   # Replace elements
Nils Werner
  • 34,832
  • 7
  • 76
  • 98
0

Well, you need to make one pass through theArray, and for each element replace it if it is in the dictionary.

for i in xrange( len( theArray ) ):
    if foo[ i ] in dict:
        foo[ i ] = dict[ foo[ i ] ]
Katriel
  • 120,462
  • 19
  • 136
  • 170
0
for i in xrange(len(the_array)):
    the_array[i] = the_dict.get(the_array[i], the_array[i])
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
0

Pythonic way without the need for data to be integer, can be even strings:

from scipy.stats import rankdata
import numpy as np

data = np.random.rand(100000)
replace = {data[0]: 1, data[5]: 8, data[8]: 10}

arr = np.vstack((replace.keys(), replace.values())).transpose()
arr = arr[arr[:,1].argsort()]

unique = np.unique(data)
mp = np.vstack((unique, unique)).transpose()
mp[np.in1d(mp[:,0], arr),1] = arr[:,1]
data = mp[rankdata(data, 'dense')-1][:,1]
caiohamamura
  • 2,260
  • 21
  • 23