clustering algorithm in python

Question

im trying to code this algorithm but im struggling with it and step 6 is confusing me my code so far is at the bottom

Set a positive value for K.
Select K different rows from the data matrix at random.
For each of the selected rows a. Copy its values to a new list, let us call it c. Each element of c is a number. (at the end of step 3, you should have the lists 1 , 2 , … , . Each of these should have the same number of columns as the data matrix)
For each row i in the data matrix a. Calculate the Manhattan distance between data row ′ and each of the lists 1 , 2 , … , . b. Assign the row ′ to the cluster of the nearest c. For instance, if the nearest c is 3 then assign row i to the cluster 3 (ie. you should have a list whose ith entry is equal to 3, let’s call this list S).
If the previous step does not change S, stop.
For each k = 1, 2, …, K a. Update . Each element j of should be equal to the median of the column ′ but only taking into consideration those rows that have been assigned to cluster k.
Go to Step 4.

Notice that in the above K is not the same thing as k

#This is what i have so far:
def clustering(matrix,k):
    for i in k:

I'm stuck with how it would choose the rows randomly and also I don't understand what step 5 and 6 mean if someone could explain

Steps 5 and 6 are off-topic IMO. Have you done any research? All you need is to select some data randomly, no? — AMC, Dec 18 '19 at 18:17
Also, this is probably a duplicate: https://stackoverflow.com/q/14262654/11301900 — AMC, Dec 18 '19 at 18:18

seralouk · Answer 1 · 2019-12-18T19:15:55.613

0

You need np.random.choice.

Use this:

import numpy as np

# some data with 10 rows and 5 columns
X=np.random.rand(10,5)

def clustering(X,k):
    # create the random indices (selector)
    random_selector = np.random.choice(range(X.shape[0]), size=k, replace=False) # replace=False to get unique samples

    # select randomly k=10 lines
    sampled_X = X[random_selector] # X[random_selector].shape = (10,5)
    .
    .
    .
    return #SOMETHING

Now you can continue working on your ?homework?

edited Dec 18 '19 at 19:15

answered Dec 18 '19 at 15:35

seralouk

30,938
9
118
133

Why not use `random.randint()`? (https://stackoverflow.com/q/14262654/11301900) – AMC Dec 18 '19 at 18:19
it's the same thing -- I prefer `choice` – seralouk Dec 18 '19 at 19:11
You do realize that this way is just more complicated for no reason, right? – AMC Dec 18 '19 at 19:14
in this particular case yes. it leads to the same desired result. – seralouk Dec 18 '19 at 19:14
Sure, but just because two methods give the same result doesn’t mean that they are equally as good. I see no benefit to using `choice`, it’s reinventing the wheel. – AMC Dec 18 '19 at 19:15
funny thing that the link you posted uses also `choice` as a second solution. `AND` the second most upvoted answer. Cheers – seralouk Dec 18 '19 at 19:23
Did you read the first answer carefully? It’s using it to draw numbers without replacement. If that’s what OP wants, then **yes**, they should use `choice` (although I would probably recommend `numpy.arange` over a generic `range()`). For the more general case, however, `randint` is the way to go. – AMC Dec 18 '19 at 19:58
Hmm actually looking at one of the answers in the other post, and the docs, it seems that you can use just `.shape[0]`, and `choice` will behave as if you had used `arange`. Useful stuff! – AMC Dec 18 '19 at 20:02
great. cheers. ! – seralouk Dec 19 '19 at 18:24

clustering algorithm in python

1 Answers1