0

Question

Suppose we are given a numpy array arr of doubles and a small positive integer n. I am looking for an efficient way to set the n least significant entries of each element of arr to 0 or to 1. Is there a ufunc for that? If not, are there suitable C functions that I could apply to the elements from Cython?

Motivation

Below I will provide the motivation for the question. If you find that an answer to the question above is not needed to fulfill the end goal, I am happy to receive respective comments. I will then create a separate question in order to keep things sorted.

The motivation for this question is to implement a version of np.unique(arr, True) that accepts a relative tolerance parameter. Thereby, the second argument of np.unique is of importance: I need to know the indices of the unique elements (first occurrence!) in the original array. Thereby, it is not important that the elements are sorted.

I am aware of questions and solutions on np.unique with tolerance. However, I have not found a solution that also returns the indices of the first occurrences of the unique elements in the original array. Furthermore, the solutions that I have seen were based on sorting, which runs in O(arr.size log(arr.size)). However, a constant-time solution is possible with a hash map.

The idea is to round each element in arr up and down and to put these elements in a hash map. If either of the values is in the hash map already, an entry is ignored. Otherwise, the element is included in the result. As insertion and lookup run in constant average time for hash maps, this method should be faster than a sorting based method in theory.

Below find my Cython implementation:

import numpy as np
cimport numpy as np
import cython
from libcpp.unordered_map cimport unordered_map

@cython.boundscheck(False)
@cython.wraparound(False)
def unique_tol(np.ndarray[DOUBLE_t, ndim=1] lower,
               np.ndarray[DOUBLE_t, ndim=1] higher):
    cdef long i, count
    cdef long endIndex = lower.size
    cdef unordered_map[double, short] vals = unordered_map[double, short]()
    cdef np.ndarray[DOUBLE_t, ndim=1] result_vals = np.empty_like(lower)
    cdef np.ndarray[INT_t, ndim=1] result_indices = np.empty_like(lower, 
                                                                  dtype=int)

    count = 0
    for i in range(endIndex): 
        if not vals.count(lower[i]) and not vals.count(higher[i]):

            # insert in result
            result_vals[count] = lower[i]
            result_indices[count] = i

            # put lowerVal and higherVal in the hashMap
            vals[lower[i]]
            vals[higher[i]]

            # update the index in the result
            count += 1

    return result_vals[:count], result_indices[:count]

This method called with appropriate rounding does the job. For example, if differences less than 10^-6 shall be ignored, we would write

unique_tol(np.round(a, 6), np.round(a+1e-6, 6))

Now I would like to replace np.round with a relative rounding procedure based on manipulation of the mantissa. I am aware of alternative ways of relative rounding, but I think manipulating the mantissa directly should be more efficient and elegant. (Admittedly, I do not think that the performance gain is significant. But I would be interested in the solution.)

EDIT

The solution by Warren Weckesser works like charm. However, the result is not applicable as I was hoping for, since two numbers with very small difference can have different exponents. Unifying the mantissa will then not lead to similar numbers. I guess I have to stick with the relative rounding solutions that are out there.

Machavity
  • 30,841
  • 27
  • 92
  • 100
Samufi
  • 2,465
  • 3
  • 19
  • 43

2 Answers2

2

"I am looking for an efficient way to set the n least significant entries of each element of arr to 0 or to 1."

You can create a view of the array with data type numpy.uint64, and then manipulate the bits in that view as needed.

For example, I'll set the lowest 21 bits in the mantissa of this array to 0.

In [46]: np.set_printoptions(precision=15)                                                            

In [47]: x = np.array([0.0, -1/3, 1/5, -1/7, np.pi, 6.02214076e23])                                   

In [48]: x                                                                                            
Out[48]: 
array([ 0.000000000000000e+00, -3.333333333333333e-01,
        2.000000000000000e-01, -1.428571428571428e-01,
        3.141592653589793e+00,  6.022140760000000e+23])

Create a view of the data in x with data type numpy.uint64:

In [49]: u = x.view(np.uint64)                                                                        

Take a look at the binary representation of the values.

In [50]: [np.binary_repr(t, width=64) for t in u]                                                     
Out[50]: 
['0000000000000000000000000000000000000000000000000000000000000000',
 '1011111111010101010101010101010101010101010101010101010101010101',
 '0011111111001001100110011001100110011001100110011001100110011010',
 '1011111111000010010010010010010010010010010010010010010010010010',
 '0100000000001001001000011111101101010100010001000010110100011000',
 '0100010011011111111000011000010111001010010101111100010100010111']

Set the lower n bits to 0, and take another look.

In [51]: n = 21                                                                                       

In [52]: u &= ~np.uint64(2**n-1)                                                              

In [53]: [np.binary_repr(t, width=64) for t in u]                                                     
Out[53]: 
['0000000000000000000000000000000000000000000000000000000000000000',
 '1011111111010101010101010101010101010101010000000000000000000000',
 '0011111111001001100110011001100110011001100000000000000000000000',
 '1011111111000010010010010010010010010010010000000000000000000000',
 '0100000000001001001000011111101101010100010000000000000000000000',
 '0100010011011111111000011000010111001010010000000000000000000000']

Because u is a view of the same data as in x, x has also been modified in-place.

In [54]: x                                                                      
Out[54]: 
array([ 0.000000000000000e+00, -3.333333332557231e-01,
        1.999999999534339e-01, -1.428571428405121e-01,
        3.141592653468251e+00,  6.022140758954589e+23])
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
1

Similar to @WarrenWeckesser's but without the black magic using "official" ufuncs instead. Downside: I'm pretty sure it's slower, quite possibly significantly so:

>>> a = np.random.normal(size=10)**5
>>> a
array([ 9.87664561e-12, -1.79654870e-03,  4.36740261e-01,  7.49256141e+00,
       -8.76894617e-01,  2.93850753e+00, -1.44149959e-02, -1.03026094e-03,
        3.18390143e-03,  3.05521581e-03])
>>> 
>>> mant,expn = np.frexp(a)
>>> mant
array([ 0.67871792, -0.91983293,  0.87348052,  0.93657018, -0.87689462,
        0.73462688, -0.92255974, -0.5274936 ,  0.81507877,  0.78213525])
>>> expn
array([-36,  -9,  -1,   3,   0,   2,  -6,  -9,  -8,  -8], dtype=int32)
>>> a_binned = np.ldexp(np.round(mant,5),expn)
>>> a_binned
array([ 9.87667590e-12, -1.79654297e-03,  4.36740000e-01,  7.49256000e+00,
       -8.76890000e-01,  2.93852000e+00, -1.44150000e-02, -1.03025391e-03,
        3.18390625e-03,  3.05523437e-03])
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99