Map array values to unique sequential values sorted by occurences

Question

In my ML app, I use an output 1D np.array Y to color code a scatterplot dots. I need to bring a variety of widely distributed integer values to sequential integers to utilize better distribution of colors in the colormap.

What I did is this:

def normalize(Y):
    U = np.unique(Y)
    for i in range(U.size):
        Y[Y==U[i]] = i
    return Y

Which replaces them with indices in array's unique'd form.

I wonder if there is a way to do this more efficiently with numpy. There's got to be a powerful one-liner somewhere out there

*Another thing I could not figure out how to do is to have the sequential values sorted accordingly to the number of corresponding occurences in Y, so that distribution of clustering was obvious on the plot.

Does this answer your question? [Getting the indices of several elements in a NumPy array at once](https://stackoverflow.com/questions/32191029/getting-the-indices-of-several-elements-in-a-numpy-array-at-once) — aydow, Feb 12 '23 at 11:40

score 0 · Answer 1 · answered Feb 12 '23 at 11:47

I would use scikit's LabelEncoder

le = preprocessing.LabelEncoder()
Ynormed = le.fit_transform(Y)

You can check the implementation on how they did it in their source code. Under the hood they also use np.unique.

Basically you can do it as a one liner by using u, indices = np.unique(a, return_inverse=True). Where indices is now also a sequential coding or your array. (This is basically what the LabelEncoder does)

For your second task you can work with return_counts=True and use it for sorting

I like your idea to use `np.unique(Y, return_inverse=True)[1]` best. — John Zwinck, Feb 12 '23 at 11:49

score 0 · Answer 2 · answered Feb 12 '23 at 11:47

0

With Pandas:

pd.Series(Y).rank(method="dense") - 1 # rank gives 1-based result

Ref: https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html

Or SciPy has the same: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rankdata.html

answered Feb 12 '23 at 11:47

John Zwinck

239,568
38
324
436

lezaf · Answer 3 · 2023-02-12T12:10:56.240

Using only numpy operations, I think that this solves your task:

def normalize(Y):
    un, cnts = np.unique(Y, return_counts=True) # Unique elements and corresponding counters
    un_mapped = np.arange(un.shape[0]) # Make the mapped sequence
    sort_inds = np.argsort(cnts) # Obtain the sorting indices for counters

    # Populate mapped numbers based on the sorted counters
    return np.repeat(un_mapped[sort_inds], cnts[sort_inds])

Sample output:

Y = np.array([3,3,2,2,2,5,5,9,9,9,9,4]) # Input
print(normalize(Y))
# Output (mapped and sorted): [2 1 1 3 3 0 0 0 4 4 4 4]

Warren Weckesser · Answer 4 · 2023-02-12T12:10:11.240

0

Your function normalize can be replaced by using the return_inverse option of np.unique.

For example,

In [9]: Y = np.array([10, 10, 100, 13, 25, 25, 10, 13, 2, 3])

In [10]: normalize(Y.copy())  # Use a copy, because `normalize` modifies Y in place.
Out[10]: array([2, 2, 5, 3, 4, 4, 2, 3, 0, 1])

In [11]: np.unique(Y, return_inverse=True)[1]
Out[11]: array([2, 2, 5, 3, 4, 4, 2, 3, 0, 1])

edited Feb 12 '23 at 12:10

answered Feb 12 '23 at 12:02

Warren Weckesser

110,654
19
194
214

Map array values to unique sequential values sorted by occurences

4 Answers4