0

Given a list of indexes (offset values) according which splitting a numpy array, I would like to adjust it so that the splitting does not occur on duplicate values. This means duplicate values will be in one chunk only.

I have worked out following piece of code, which gives the result, but I am not super proud of it. I would like to stay in numpy world and use vectorized numpy functions as much as possible. But to check the indexes (offset values) I use a for loop, and store the result in a list.

Do you have any idea how to vectorize the 2nd part?

If this can help, ar is an ordered array. (I am not using this info in below code).

import numpy as np
import vaex as vx

ar = np.array([8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,13,15,len(ar)])

_, unique_ind = np.unique(ar, return_index=True, return_inverse=False)
dup_ind = np.diff(unique_ind, append=len(ar))
dup_start = unique_ind[dup_ind > 1]
dup_end = dup_start + dup_ind[dup_ind > 1]
print(f'initial offsets: {offsets}')
#print(f'dup start: {dup_start}')
#print(f'dup end: {dup_end}')

temp = []
for off in offsets:
    for ind in range(len(dup_start)):
        if off > dup_start[ind] and off < dup_end[ind]:
            off = dup_start[ind]
            break
    temp.append(off)

# Remove duplicates
offsets = list(dict.fromkeys(temp))
print(f'modified offsets: {offsets}')

Results

initial offsets: [ 0  2  4  9 11 13 15 20]
modified offsets: [0, 4, 8, 11, 13, 14, 20]
pierre_j
  • 895
  • 2
  • 11
  • 26

2 Answers2

1

You can use np.digitize to clamp the offsets into bins:

ar = np.array([8,8,8,10,11,11,13,14,15,15,18,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,13,15,len(ar)])

_, unique_ind = np.unique(ar, return_index=True, return_inverse=False)
dup_ind = np.diff(unique_ind, append=len(ar))
dup = np.append(unique_ind[dup_ind > 1], len(ar))

offsets = dup[np.digitize(offsets, dup) - 1]

Output:

>>> offsets
array([ 0,  0,  4,  8, 11, 11, 17])

>>> np.unique(offsets)
array([ 0,  4,  8, 11, 17])

It can very likely be compressed into much less code (1 or 2 lines probably), but I figured that you really just want the loop eliminated, so I though I'd submit what I've found.

  • Thanks! No, this code is not quite right, and my example was not complete. I have completed it to include a 'normal' case when an offset does not fall on a duplicate value. In this case, we simply keep it. I tried with your code, and value is not kept. – pierre_j Dec 18 '21 at 18:30
0

Building upon @richardec's answer, here is a possible solution, in case it can help others.

import numpy as np

def adjust_offsets(values, offsets) -> np.ndarray:
    # Creates bins describing when duplicates start and when they end.
    _, indexes, counts = np.unique(values, return_index=True, return_counts=True)
    duplicates = counts>1
    dup_starts = indexes[duplicates]
    dup_ends = dup_starts + counts[duplicates]
    # Zipping starts and ends.
    # Bins of duplicates are interlayed with bins of unique values.
    bins = np.dstack((dup_starts,dup_ends)).flatten()
    # Set start of bin for offsets falling in a bin.
    off_binned = np.digitize(offsets, bins, right=True)
    return np.unique(np.where((off_binned%2)==0,offsets,bins[off_binned-1]))

# Test setup 1 / starting with no duplicates
# idx:         0,1,2,3,4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21
ar = np.array([4,5,8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,14,18,len(ar)-1])
off_adjusted = adjust_offsets(ar, offsets)
ref = np.array([ 0,  2,  9, 10, 14, 16])
assert np.array_equal(off_adjusted, ref)

# Test setup 2 / starting with duplicates
# idx:         0,1,2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19
ar = np.array([8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,13,15,len(ar)-1])
off_adjusted = adjust_offsets(ar, offsets)
ref = np.array([ 0,  4,  8, 11, 13, 14])
assert np.array_equal(off_adjusted, ref)

I use 2 tests as a previous version I come up with would not work in case the array was not starting with duplicates.

In 2 sentences, the algorithm follows the logic:

  • if an offset value (index) fall in a bin defined by [start of duplicate, end of duplicate], then it is reset to 'start of duplicate'.
  • if not, it is unmodified

There is possibly faster, as I could see from this related SO answer that np.diff is faster than np.unique and I think one can obtain the same kind of information with np.diff. I started assessing this possibility, but it grew too complex for me.

pierre_j
  • 895
  • 2
  • 11
  • 26