8

I have a quite large 1d numpy array Xold with given values. These values shall be replaced according to the rule specified by a 2d numpy array Y: An example would be

Xold=np.array([0,1,2,3,4])
Y=np.array([[0,0],[1,100],[3,300],[4,400],[2,200]])

Whenever a value in Xold is identical to a value in Y[:,0], the new value in Xnew should be the corresponding value in Y[:,1]. This is accomplished by two nested for loops:

Xnew=np.zeros(len(Xold))
for i in range(len(Xold)):
for j in range(len(Y)):
    if Xold[i]==Y[j,0]:
        Xnew[i]=Y[j,1]

With the given example, this yields Xnew=[0,100,200,300,400]. However, for large data sets this procedure is quite slow. What is a faster and more elegant way to accomplish this task?

Jann
  • 635
  • 1
  • 6
  • 17

8 Answers8

5

SELECTING THE FASTEST METHOD

Answers to this question provided a nice assortment of ways to replace elements in numpy array. Let's check, which one would be the quickest.

TL;DR: Numpy indexing is the winner

 def meth1(): # suggested by @Slam
    for old, new in Y:  
        Xold[Xold == old] = new

 def meth2(): # suggested by myself, convert y_dict = dict(Y) first
     [y_dict[i] if i in y_dict.keys() else i for i in Xold]

 def meth3(): # suggested by @Eelco Hoogendoom, import numpy_index as npi first
     npi.remap(Xold, keys=Y[:, 0], values=Y[:, 1])

 def meth4(): # suggested by @Brad Solomon, import pandas as pd first 
     pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0])).values

  # suggested by @jdehesa. create Xnew = Xold.copy() and index
  # idx = np.searchsorted(Xold, Y[:, 0]) first
  def meth5():             
     Xnew[idx] = Y[:, 1]

Not so surprising results

 In [39]: timeit.timeit(meth1, number=1000000)                                                                      
 Out[39]: 12.08

 In [40]: timeit.timeit(meth2, number=1000000)                                                                      
 Out[40]: 2.87

 In [38]: timeit.timeit(meth3, number=1000000)                                                                      
 Out[38]: 55.39

 In [12]: timeit.timeit(meth4, number=1000000)                                                                                      
 Out[12]: 256.84

 In [50]: timeit.timeit(meth5, number=1000000)                                                                                      
 Out[50]: 1.12

So, the good old list comprehension is the second fastest, and the winning approach is numpy indexing combined with searchsorted().

Daniel Kislyuk
  • 956
  • 10
  • 11
4

We can use np.searchsorted for a generic case when the data in first column of Y is not necessarily sorted -

sidx = Y[:,0].argsort()
out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]

Sample run -

In [53]: Xold
Out[53]: array([14, 10, 12, 13, 11])

In [54]: Y
Out[54]: 
array([[ 10,   0],
       [ 11, 100],
       [ 13, 300],
       [ 14, 400],
       [ 12, 200]])

In [55]: sidx = Y[:,0].argsort()
    ...: out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]

In [56]: out
Out[56]: array([400,   0, 200, 300, 100])

If not all elements have corresponding mappings available, then we need to do a bit more of work, like so -

sidx = Y[:,0].argsort()
sorted_indx = np.searchsorted(Y[:,0], Xold, sorter=sidx)
sorted_indx[sorted_indx==len(sidx)] = len(sidx)-1
idx_out = sidx[sorted_indx]
out = Y[idx_out,1]
out[Y[idx_out,0]!=Xold] = 0 # NA values as 0s
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This is good, although, assuming there exists the case where not all values have a mapping, I'm not sure if those should be set to 0/NA/... or left as they were in `Xold`. But I think that would just mean replacing the last `0` with `Xold[Y[idx_out,0]!=Xold]` so good solution in any case. – jdehesa Nov 05 '18 at 14:55
  • 1
    @jdehesa OP has `Xnew=np.zeros(len(Xold))` for the output. So, that made sense to me to use the same. – Divakar Nov 05 '18 at 14:56
  • this code didn't work for me: `In [16]: sidx = Y[:,0].argsort()` `In [17]: out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]` `IndexError: index 5 is out of bounds for axis 1 with size 5` – Daniel Kislyuk Nov 06 '18 at 12:20
  • as it was pointed out by @MihaiAlexandruIonut this is because Xold in my example contains elements missing from Y. however, there was no initial limitation that this can't be the case. – Daniel Kislyuk Nov 06 '18 at 12:45
  • @DanielKislyuk Hence, the generic solution at the end of my post. – Divakar Nov 06 '18 at 13:20
3

Here is one possibility:

import numpy as np

Xold = np.array([0, 1, 2, 3, 4])
Y = np.array([[0, 0], [1, 100], [3, 300], [4, 400], [2, 200]])
# Check every X value against every Y first value
m = Xold == Y[:, 0, np.newaxis]
# Check which elements in X are among Y first values
# (so values that are not in Y are not replaced)
m_X = np.any(m, axis=0)
# Compute replacement
# Xold * (1 - m_X) are the non-replaced values
# np.sum(Y[:, 1, np.newaxis] * m, axis=0) * m_X are the replaced values
Xnew = Xold * (1 - m_X) + np.sum(Y[:, 1, np.newaxis] * m, axis=0) * m_X
print(Xnew)

Output:

[  0 100 200 300 400]

This method works for more or less every case (unsorted arrays, multiple repetitions of values in X, values in X not replaced, values in Y not replacing anything in X), except if you give two replacements for the same value in Y, which would be wrong anyway. However, its time and space complexity is the product of the sizes of X and Y. If your problem has additional constraints (data is sorted, no repetitions, etc.) it might be possible to do something better. For example, if X is sorted with no repeated elements and every value in Y replaces a value in X (like in your example), this would probably be faster:

import numpy as np

Xold = np.array([0, 1, 2, 3, 4])
Y = np.array([[0, 0], [1, 100], [3, 300], [4, 400], [2, 200]])
idx = np.searchsorted(Xold, Y[:, 0])
Xnew = Xold.copy()
Xnew[idx] = Y[:, 1]
print(Xnew)
# [  0 100 200 300 400]
jdehesa
  • 58,456
  • 7
  • 77
  • 121
2

First improvement you can do is to use numpy indexing, but you'll still have 1 loop:

for old, new in Y: 
    Xold[Xold == old] = new
Slam
  • 8,112
  • 1
  • 36
  • 44
  • This is incorrect. It returns `[200 200 200 300 400 400]` for my case `X = np.array([0,1,2,3,4,4]); Y = np.array([[0,1],[1,2],[3,300],[4,400],[2,200]])` – mathfux Aug 21 '20 at 11:32
1

You can use slicing features in combination with argsort method.

Xnew = Y[Y[:,1].argsort()][:, 1][Xold] 

Output

array([  0, 100, 200, 300, 400])
Mihai Alexandru-Ionut
  • 47,092
  • 13
  • 101
  • 128
0

Solution with pd.Series.map()

If you're open to using the Pandas library, you can also do this in a vectorized way with .map():

>>> import pandas as pd
>>> pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0]))                                                                                                                                                                    
0      0
1    100
2    200
3    300
4    400
dtype: int64

>>> pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0])).values                                                                                                                                                            
array([  0, 100, 200, 300, 400])

For the signature a.map(b), a looks for its corresponding entries in the index of b, and maps to the respective values in b.

b here is pd.Series(Y[:, 1], index=Y[:, 0]), which uses the 0th column as the index and 1st column as the values that get mapped to.


Using pandas.core.algorithms directly

Under the hood, this will use .get_indexer() and the Cython-implemented take_1d():

indexer = mapper.index.get_indexer(values)
new_values = algorithms.take_1d(mapper._values, indexer)

Knowing that, if the arrays are really massive you could cut down some overhead like this:

from pandas.core import algorithms

indexer = pd.Index(Y[:, 0]).get_indexer(Xold)  
mapped = algorithms.take_1d(Y[:, 1], indexer)
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
0

The numpy_indexed package (disclaimer; I am its author) contains an efficient vectorized function that solves the general problem:

import numpy_indexed as npi
Xnew = npi.remap(Xold, keys=Y[:, 0], values=Y[:, 1])

That is, this will work for any dtype, or when keys and values to be replaced are themselves ndarrays, and you get a kwarg to specify how to react to missing elements.

Not sure how it compares to pandas performance-wise; but one of the design choices in this library is that performing elementary operations like this (or doing group-by etc) should not involve the creation of an entire new datatype like a Series or Table, which always bothered me about using pandas for this type of thing.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
0

You could convert Y to dictionary with y = dict(Y) and then run the following list comprehension

[y[i] if i in y.keys() else i for i in Xold]
Daniel Kislyuk
  • 956
  • 10
  • 11