Given a list of indexes (offset values) according which splitting a numpy array, I would like to adjust it so that the splitting does not occur on duplicate values. This means duplicate values will be in one chunk only.
I have worked out following piece of code, which gives the result, but I am not super proud of it. I would like to stay in numpy world and use vectorized numpy functions as much as possible.
But to check the indexes (offset values) I use a for
loop, and store the result in a list.
Do you have any idea how to vectorize the 2nd part?
If this can help, ar
is an ordered array.
(I am not using this info in below code).
import numpy as np
import vaex as vx
ar = np.array([8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,13,15,len(ar)])
_, unique_ind = np.unique(ar, return_index=True, return_inverse=False)
dup_ind = np.diff(unique_ind, append=len(ar))
dup_start = unique_ind[dup_ind > 1]
dup_end = dup_start + dup_ind[dup_ind > 1]
print(f'initial offsets: {offsets}')
#print(f'dup start: {dup_start}')
#print(f'dup end: {dup_end}')
temp = []
for off in offsets:
for ind in range(len(dup_start)):
if off > dup_start[ind] and off < dup_end[ind]:
off = dup_start[ind]
break
temp.append(off)
# Remove duplicates
offsets = list(dict.fromkeys(temp))
print(f'modified offsets: {offsets}')
Results
initial offsets: [ 0 2 4 9 11 13 15 20]
modified offsets: [0, 4, 8, 11, 13, 14, 20]