0

In my ML app, I use an output 1D np.array Y to color code a scatterplot dots. I need to bring a variety of widely distributed integer values to sequential integers to utilize better distribution of colors in the colormap.

What I did is this:

def normalize(Y):
    U = np.unique(Y)
    for i in range(U.size):
        Y[Y==U[i]] = i
    return Y

Which replaces them with indices in array's unique'd form.

I wonder if there is a way to do this more efficiently with numpy. There's got to be a powerful one-liner somewhere out there

*Another thing I could not figure out how to do is to have the sequential values sorted accordingly to the number of corresponding occurences in Y, so that distribution of clustering was obvious on the plot.

Michael Hall
  • 2,834
  • 1
  • 22
  • 40
Lex Podgorny
  • 2,598
  • 1
  • 23
  • 40
  • Does this answer your question? [Getting the indices of several elements in a NumPy array at once](https://stackoverflow.com/questions/32191029/getting-the-indices-of-several-elements-in-a-numpy-array-at-once) – aydow Feb 12 '23 at 11:40

4 Answers4

0

I would use scikit's LabelEncoder

le = preprocessing.LabelEncoder()
Ynormed = le.fit_transform(Y)

You can check the implementation on how they did it in their source code. Under the hood they also use np.unique.

Basically you can do it as a one liner by using u, indices = np.unique(a, return_inverse=True). Where indices is now also a sequential coding or your array. (This is basically what the LabelEncoder does)


For your second task you can work with return_counts=True and use it for sorting

Daraan
  • 1,797
  • 13
  • 24
0

With Pandas:

pd.Series(Y).rank(method="dense") - 1 # rank gives 1-based result

Ref: https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html

Or SciPy has the same: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rankdata.html

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
0

Using only numpy operations, I think that this solves your task:

def normalize(Y):
    un, cnts = np.unique(Y, return_counts=True) # Unique elements and corresponding counters
    un_mapped = np.arange(un.shape[0]) # Make the mapped sequence
    sort_inds = np.argsort(cnts) # Obtain the sorting indices for counters

    # Populate mapped numbers based on the sorted counters
    return np.repeat(un_mapped[sort_inds], cnts[sort_inds])

Sample output:

Y = np.array([3,3,2,2,2,5,5,9,9,9,9,4]) # Input
print(normalize(Y))
# Output (mapped and sorted): [2 1 1 3 3 0 0 0 4 4 4 4]
lezaf
  • 482
  • 2
  • 10
0

Your function normalize can be replaced by using the return_inverse option of np.unique.

For example,

In [9]: Y = np.array([10, 10, 100, 13, 25, 25, 10, 13, 2, 3])

In [10]: normalize(Y.copy())  # Use a copy, because `normalize` modifies Y in place.
Out[10]: array([2, 2, 5, 3, 4, 4, 2, 3, 0, 1])

In [11]: np.unique(Y, return_inverse=True)[1]
Out[11]: array([2, 2, 5, 3, 4, 4, 2, 3, 0, 1])
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214