2

I have a 2 dimensional numpy array, and I would like each element to be rounded to the closest number in a sequence. The array has shape (28000, 24).

The sequence, for instance, would be [0, 0.05, 0.2, 0.33, 0.5].

E.g. an original 0.27 would be rounded to 0.33, and 0.42 would be rounded to 0.5

This is what I use so far, but it is of course really slow with a double loop.

MWE:

arr = np.array([[0.14, 0.18], [0.20, 0.27]])
new = []
sequence = np.array([0, 0.05, 0.2, 0.33, 0.5])
for i in range(len(arr)):
    row = []
    for j in range(len(arr[0])):
        temp = (arr[i][j] - sequence)**2
        row.append(list(sequence[np.where(temp == min(temp))])[0])
    new.append(row)

Result:

[[0.2000001, 0.2000001], [0.2000001, 0.33000001]]  

Motivation:

In machine learning, I am making predictions. Since the outcomes are reflections of confidence by experts, it could be that 2/3 gave a 1 (thus 0.66). So, in this data, relatively many 0, 0.1, 0.2, 0.33, 0.66, 0.75 etc. would occur. My predictions are however something like 0.1724. I would remove a lot of prediction error by rounding in this case to 0.2.

How to optimize rounding all elements?

Update: I now pre-allocated memory, so there doesn't have to be constant appending.

 # new = [[0]*len(arr[0])] * len(arr), then unloading into new[i][j],
 # instead of appending 

Timings:

Original problem: 36.62 seconds
Pre-allocated array: 15.52 seconds  
shx2 SOLUTION 1 (extra dimension): 0.47 seconds
shx2 SOLUTION 2 (better for big arrays): 4.39 seconds
Jaime's np.digitize: 0.02 seconds
shx2
  • 61,779
  • 13
  • 130
  • 153
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160

3 Answers3

4

Another truly vectorized solution with intermediate storage not larger than the array to be processed could be built around np.digitize.

>>> def round_to_sequence(arr, seq):
...     rnd_thresholds = np.add(seq[:-1], seq[1:]) / 2
...     arr = np.asarray(arr)
...     idx = np.digitize(arr.ravel(), rnd_thresholds).reshape(arr.shape)
...     return np.take(seq, idx)
... 
>>> round_to_sequence([[0.14, 0.18], [0.20, 0.27]],
...                   [0, 0.05, 0.2, 0.33, 0.5])
array([[ 0.2 ,  0.2 ],
       [ 0.2 ,  0.33]])

UPDATE So what's going on... The first line in the function figures out what the mid points between the items in the sequence are. This values are the thresholds for rounding: below it, you have to round down, above it, you have to round up. I use np.add, instead of the more clear seq[:-1] + seq[1:] so that it accepts a list or tuple without needing to explicitly convert it to a numpy array.

>>> seq = [0, 0.05, 0.2, 0.33, 0.5]
>>> rnd_threshold = np.add(seq[:-1], seq[1:]) / 2
>>> rnd_threshold
array([ 0.025,  0.125,  0.265,  0.415])

Next we use np.digitize to find out in what bin, as delimited by those threshold values, each item in the array is. np.digitize only takes 1D arrays, so we have to do the .ravel plus .reshape thing to keep the original shape of the array. As is, it uses the standard convention that items on the limit are rounded up, you could reverse this behavior by using the right keyword argument.

>>> arr = np.array([[0.14, 0.18], [0.20, 0.27]])
>>> idx = np.digitize(arr.ravel(), seq).reshape(arr.shape)
>>> idx
array([[2, 2],
       [3, 3]], dtype=int64)

Now all we need to do is create an array the shape of idx, using its entries to index the sequence of values to round to. This could be achieved with seq[idx], but it is often (always?) faster (see here) to use np.take.

>>> np.take(seq, idx)
array([[ 0.2 ,  0.2 ],
       [ 0.33,  0.33]])
Community
  • 1
  • 1
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • This is really great. If you provide explanation, I will mark this as the answer since it is roughly 20-25 times faster than the currently accepted answer. – PascalVKooten Nov 01 '13 at 15:57
2

Original Question

The original question stated that the OP wanted to round to the nearest 0.1, which has the following simple solution...

Really simple - let numpy do it for you:

arr = np.array([[0.14, 0.18], [0.20, 0.27]])
numpy.around(arr, decimals=1)

When developing scientific software in Python, it is key to avoid loops if possible. If numpy has a procedure to do something, use it.

Community
  • 1
  • 1
Alex Chamberlain
  • 4,147
  • 2
  • 22
  • 49
  • I only used it as an example, I shouldn't have I realised now. I am not interested in the decimal only case. What if the list would contain `[0, 0,1 0.33, 0.5, 0.2]`? – PascalVKooten Nov 01 '13 at 08:10
  • 1
    Awesome answer Alex! :) I was going to give a raw python solution, but this is awesome. – Games Brainiac Nov 01 '13 at 08:10
  • @Dualinity I think you can still remove the loops. Will the list ever be "large"? Or does the list have a better definition than just a list of numbers? – Alex Chamberlain Nov 01 '13 at 08:12
  • I surely hope so, but how in this case? My apologies for choosing an example that has such a solution, my intention is indeed to round to anything. – PascalVKooten Nov 01 '13 at 08:14
  • Alex, please remove "Original question"; I did mention it was just an **example**, and that I would like to round to numbers in a sequence! – PascalVKooten Nov 01 '13 at 08:22
  • @AlexChamberlain The list contains predictions. I know that there is a higher chance of these numbers being correct when they are rounded to decimals, or 0.33 or 0.66 for instance. – PascalVKooten Nov 01 '13 at 08:24
  • @Dualinity Please ask the simplest question on SO you can, but not simpler, otherwise you end up with problems like this. The problem originally stated - as you can now see - is a very different one to the problem now stated. shx2 has proposed the best solution I can think of. – Alex Chamberlain Nov 01 '13 at 08:32
  • @AlexChamberlain Yea, I should have been more clear, though I did state that it was just an example, and what I was really looking for was rounding to a sequence. – PascalVKooten Nov 01 '13 at 08:34
1

I would like to suggest two solutions to your problem. The first is a pure numpy solution, but if you original array is NxM, and sequence size is K, it uses an array of size NxMxK. So this solution is good only if this size is not gigantic in your case. It can still turn out to be very fast despite the big array used, for doing all the work in the numpy space.

The second is a hybrid approach (and turns out to be much simpler to code, too), using @np.vectorize. It does looping in numpy space, but calls back to python for each element. The upside is that it avoids creating the huge array.

Both are valid solutions. You choose the one which works best with your array sizes.

Also, both work with arrays with any number of dimensions.

SOLUTION 1

import numpy as np

a = np.random.random((2,4))
a
=> 
array([[ 0.5501662 ,  0.13055979,  0.579619  ,  0.3161156 ],
       [ 0.07327783,  0.45156743,  0.38334009,  0.48772392]])

seq = np.array([ 0.1, 0.3, 0.6, 0.63 ])

# create 3-dim array of all the distances
all_dists = np.abs(a[..., np.newaxis] - seq)
all_dists.shape
=> (2, 4, 4)
all_dists
=>
array([[[ 0.4501662 ,  0.2501662 ,  0.0498338 ,  0.0798338 ],
        [ 0.03055979,  0.16944021,  0.46944021,  0.49944021],
        [ 0.479619  ,  0.279619  ,  0.020381  ,  0.050381  ],
        [ 0.2161156 ,  0.0161156 ,  0.2838844 ,  0.3138844 ]],

       [[ 0.02672217,  0.22672217,  0.52672217,  0.55672217],
        [ 0.35156743,  0.15156743,  0.14843257,  0.17843257],
        [ 0.28334009,  0.08334009,  0.21665991,  0.24665991],
        [ 0.38772392,  0.18772392,  0.11227608,  0.14227608]]])

# find where each element gets its closest, i.e. min dist
closest_idxs = all_dists.argmin(axis = -1)
closest_idxs
=> 
array([[2, 0, 2, 1],
       [0, 2, 1, 2]])

# choose
seq[closest_idxs]
=>
array([[ 0.6,  0.1,  0.6,  0.3],
       [ 0.1,  0.6,  0.3,  0.6]])

SOLUTION 2

@np.vectorize
def find_closest(x):
    dists = np.abs(x-seq)
    return seq[dists.argmin()]

find_closest(a)
=> 
array([[ 0.6,  0.1,  0.6,  0.3],
       [ 0.1,  0.6,  0.3,  0.6]])
shx2
  • 61,779
  • 13
  • 130
  • 153