3

Given the following array:

a = np.array([[1,2,3],[4,5,6],[7,8,9]])

[[1 2 3]
 [4 5 6]
 [7 8 9]]

How can I replace certain values with other values?

bad_vals = [4, 2, 6]
update_vals = [11, 1, 8]

I currently use:

for idx, v in enumerate(bad_vals):
    a[a==v] = update_vals[idx]

Which gives:

[[ 1  1  3]
 [11  5  8]
 [ 7  8  9]]

But it is rather slow for large arrays with many values to be replaced. Is there any good alternative?

The input array can be changed to anything (list of list/tuples) if this might be necessary to access certain speedy black magic.

EDIT:

Based on the great answers from @Divakar and @charlysotelo did a quick comparison for my real use-case date using the benchit package. My input data array has more or less a of ratio 100:1 (rows:columns) where the length of array of replacement values are in order of 3 x rows size.

Functions:

# current approach
def enumerate_values(a, bad_vals, update_vals):
    for idx, v in enumerate(bad_vals):
        a[a==v] = update_vals[idx]
    return a

# provided solution @Divakar
def map_values(a, bad_vals, update_vals):
    N = max(a.max(), max(bad_vals))+1
    mapar = np.empty(N, dtype=int)
    mapar[a] = a
    mapar[bad_vals] = update_vals
    out = mapar[a]
    return out

# provided solution @charlysotelo
def vectorize_values(a, bad_vals, update_vals):
    bad_to_good_map = {}
    for idx, bad_val in enumerate(bad_vals):
        bad_to_good_map[bad_val] = update_vals[idx]
    f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
    a = f(a)

    return a

# define benchit input functions
import benchit
funcs = [enumerate_values, map_values, vectorize_values]

# define benchit input variables to bench against
in_ = {
    n: (
        np.random.randint(0,n*10,(n,int(n * 0.01))), # array
        np.random.choice(n*10, n*3,replace=False), # bad_vals
        np.random.choice(n*10, n*3) # update_vals
    ) 
    for n in [300, 1000, 3000, 10000, 30000]
}

# do the bench
# btw: timing of bad approaches (my own function here) take time
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, grid=False)

timings benchit

Mattijn
  • 12,975
  • 15
  • 45
  • 68
  • Are the values (positive) integral? Can we thus make a list like `[0,1,1,3,11,5,8]` (that thus defines the mapping) – Willem Van Onsem Jun 03 '20 at 21:52
  • You could use answers from [Fast replacement of values in a numpy array](https://stackoverflow.com/questions/3403973/fast-replacement-of-values-in-a-numpy-array) by making a dictionary from bad_vals and update_vals. – DarrylG Jun 03 '20 at 21:57
  • @WillemVanOnsem Yes, all values are positive integers – Mattijn Jun 03 '20 at 21:58
  • @Divakar, yes will give! Had to sleep a bit.. – Mattijn Jun 04 '20 at 07:17

2 Answers2

3

Here's one way based on the hinted mapping array method for positive numbers -

def map_values(a, bad_vals, update_vals):
    N = max(a.max(), max(bad_vals))+1
    mapar = np.empty(N, dtype=int)
    mapar[a] = a
    mapar[bad_vals] = update_vals
    out = mapar[a]
    return out

Sample run -

In [94]: a
Out[94]: 
array([[1, 2, 1],
       [4, 5, 6],
       [7, 1, 1]])

In [95]: bad_vals
Out[95]: [4, 2, 6]

In [96]: update_vals
Out[96]: [11, 1, 8]

In [97]: map_values(a, bad_vals, update_vals)
Out[97]: 
array([[ 1,  1,  1],
       [11,  5,  8],
       [ 7,  1,  1]])

Benchmarking

# Original soln
def replacevals(a, bad_vals, update_vals):
    out = a.copy()
    for idx, v in enumerate(bad_vals):
        out[out==v] = update_vals[idx]
    return out

The given sample had the 2D input of nxn with n samples to be replaced. Let's setup input datasets with the same structure.

Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.

import benchit
funcs = [replacevals, map_values]
in_ = {n:(np.random.randint(0,n*10,(n,n)),np.random.choice(n*10,n,replace=False),np.random.choice(n*10,n)) for n in [3,10,100,1000,2000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, save='timings.png')

Plot :

enter image description here

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This is a really nice solution. It is 740X more quick than my solution for my real use case. Thanks for sharing this. Also nice `benchit` package. Let me try to see if I can combine the other solutions (which was 55X more quick than my approack) in a chart and update my answer with this. Thanks again! – Mattijn Jun 04 '20 at 07:50
  • 1
    @Mattijn Yeah you can just add any other approach into `funcs = [replacevals, map_values]` with the function name(s). Should be convenient that way. Would like to see your chart(s), if you would like to share. – Divakar Jun 04 '20 at 07:52
  • @Divakar--Benchit looks interesting. How does benchit compare to [Perfplot](https://pypi.org/project/perfplot/) which I have used? Any advantages/disadvantages? – DarrylG Jun 04 '20 at 09:35
  • @DarrylG Good to see your interest! Well, I had used `perfplot`, but I wanted something different. Core ideas while developing `benchit` was to have : 1) Minimal steps/codes to get to the final plot. 2) Have a platform, where we can do more. So, getting speedups, etc. and also being able to manipulate the benchmarked results was the idea. You should have a look here if you are interested : https://benchit.readthedocs.io/en/latest/workflow.html. It lacks assert for equality though, which seems to be useful sometimes from my limited exposure to `perfplot`. – Divakar Jun 04 '20 at 09:43
  • 1
    @Divakar--OK, will give it a try for my next benchmark. Two advantages I see benchit has are: 1) it shows the test environment information on the top left of the screen, 2) it has a nicer grid (horizontal & vertical) to display the results. – DarrylG Jun 04 '20 at 09:55
  • @Divakar--wanted to give benchit a try, but I get an error with "Minimal Workflow" example. Error: `Traceback (most recent call last): File "main.py", line 9, in t = benchit.timings(funcs, inputs) AttributeError: module 'benchit' has no attribute 'timings`. My code is pretty much the workflow example. – DarrylG Jul 04 '20 at 01:41
  • @DarrylG That looks odd. Did you do `pip install benchit` for the installation? Can you try out on a new session, just to make sure that the `benchit` used is actually from the said module? – Divakar Jul 04 '20 at 05:19
  • @DarrylG That most probably would be because you have a dir named "benchit" in the current working dir and that's picked up as the module. So, suggestion would be to install with `pip install benchit`, if you haven't done so and then use a clean dir (i.e. make sure there's no dir named benchit) and test there. – Divakar Jul 04 '20 at 07:04
  • @Divakar--trying using [code in this online repl](https://repl.it/@DarrylGurganiou/EvenModestApplicationserver). It automatically installs packages which don't exists based upon import statement. I don't see a benchit subfolder in the current working directory. – DarrylG Jul 04 '20 at 08:48
  • @DarrylG That was importing some other package named "bench-it", not sure why. Also, it needs qt5 matplotlib backend. So, these online interpreters won't work at the moment. Can you try out on a local workstation? – Divakar Jul 04 '20 at 10:24
  • @Divakar--nice catch on the cause. Seems benchit depends upon pandas as a backend, which currently is not working on my local machine (Jupyter notebook) due to some library incompatibility issue. Been meaning to rebuild a new Python environment but have stalled on that since found online repl okay for Pandas needs. This gives me another reason to rebuild the environment so Pandas works. – DarrylG Jul 04 '20 at 10:44
  • @Divakar--was able to get it to load benchit rather than bench-it by specifying benchit in the the Python spec file (i.e. pyproject.toml). Unfortunately, the `import benchit` line now has trouble with matplotlib so I'll abandon for today. FYI--last part of error: `from matplotlib.backends.qt_compat import QtGui File "/opt/virtualenvs/python3/lib/python3.8/site-packages/matplotlib/backends/qt_compat.py", line 168, in raise ImportError("Failed to import any qt binding") ImportError: Failed to import any qt binding` – DarrylG Jul 04 '20 at 15:02
  • @DarrylG Please check out with the new release of benchit. – Divakar Jul 05 '20 at 08:40
  • 1
    @Divakar--Thanks! I was able to run your basic test on the online Python with the new release. Adding `bench = "^0.0.3" to the Python spec file is needed for it to load benchit and its dependencies, although it still loads bench-it also. – DarrylG Jul 05 '20 at 09:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/217240/discussion-between-darrylg-and-divakar). – DarrylG Jul 05 '20 at 10:02
2

This really depends on the size of your array, and the size of your mappings from bad to good integers.

For a larger number of bad to good integers - the method below is better:

import numpy as np
import time

ARRAY_ROWS = 10000
ARRAY_COLS = 1000

NUM_MAPPINGS = 10000

bad_vals = np.random.rand(NUM_MAPPINGS)
update_vals = np.random.rand(NUM_MAPPINGS)

bad_to_good_map = {}
for idx, bad_val in enumerate(bad_vals):
    bad_to_good_map[bad_val] = update_vals[idx]

# np.vectorize with mapping
# Takes about 4 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
print (time.time())
a = f(a)
print (time.time())


# Your way
# Takes about 60 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
print (time.time())
for idx, v in enumerate(bad_vals):
    a[a==v] = update_vals[idx]
print (time.time())

Running the code above it took less than 4 seconds for the np.vectorize(lambda) way to finish - whereas your way took almost 60 seconds. However, setting the NUM_MAPPINGS to 100, your method takes less than a second for me - faster than the 2 seconds for the np.vectorize way.

  • 1
    Thanks a lot for sharing your solution which provided a 55X speedup compare to my solutions in my real data. While amazing, the solution provided by @Divakar had a speedup of 741X. Thanks again! – Mattijn Jun 04 '20 at 07:52