Question
Suppose we are given a numpy array arr
of doubles and a small positive integer n
. I am looking for an efficient way to set the n
least significant entries of each element of arr
to 0
or to 1
. Is there a ufunc
for that? If not, are there suitable C functions that I could apply to the elements from Cython?
Motivation
Below I will provide the motivation for the question. If you find that an answer to the question above is not needed to fulfill the end goal, I am happy to receive respective comments. I will then create a separate question in order to keep things sorted.
The motivation for this question is to implement a version of np.unique(arr, True)
that accepts a relative tolerance parameter. Thereby, the second argument of np.unique
is of importance: I need to know the indices of the unique elements (first occurrence!) in the original array. Thereby, it is not important that the elements are sorted.
I am aware of questions and solutions on np.unique with tolerance. However, I have not found a solution that also returns the indices of the first occurrences of the unique elements in the original array. Furthermore, the solutions that I have seen were based on sorting, which runs in O(arr.size log(arr.size)). However, a constant-time solution is possible with a hash map.
The idea is to round each element in arr
up and down and to put these elements in a hash map. If either of the values is in the hash map already, an entry is ignored. Otherwise, the element is included in the result. As insertion and lookup run in constant average time for hash maps, this method should be faster than a sorting based method in theory.
Below find my Cython implementation:
import numpy as np
cimport numpy as np
import cython
from libcpp.unordered_map cimport unordered_map
@cython.boundscheck(False)
@cython.wraparound(False)
def unique_tol(np.ndarray[DOUBLE_t, ndim=1] lower,
np.ndarray[DOUBLE_t, ndim=1] higher):
cdef long i, count
cdef long endIndex = lower.size
cdef unordered_map[double, short] vals = unordered_map[double, short]()
cdef np.ndarray[DOUBLE_t, ndim=1] result_vals = np.empty_like(lower)
cdef np.ndarray[INT_t, ndim=1] result_indices = np.empty_like(lower,
dtype=int)
count = 0
for i in range(endIndex):
if not vals.count(lower[i]) and not vals.count(higher[i]):
# insert in result
result_vals[count] = lower[i]
result_indices[count] = i
# put lowerVal and higherVal in the hashMap
vals[lower[i]]
vals[higher[i]]
# update the index in the result
count += 1
return result_vals[:count], result_indices[:count]
This method called with appropriate rounding does the job. For example, if differences less than 10^-6 shall be ignored, we would write
unique_tol(np.round(a, 6), np.round(a+1e-6, 6))
Now I would like to replace np.round
with a relative rounding procedure based on manipulation of the mantissa. I am aware of alternative ways of relative rounding, but I think manipulating the mantissa directly should be more efficient and elegant. (Admittedly, I do not think that the performance gain is significant. But I would be interested in the solution.)
EDIT
The solution by Warren Weckesser works like charm. However, the result is not applicable as I was hoping for, since two numbers with very small difference can have different exponents. Unifying the mantissa will then not lead to similar numbers. I guess I have to stick with the relative rounding solutions that are out there.