1

I am trying to build an algorithm for finding number of clusters. I need to assign random points from a data set as initial means.

I first tried the following code :

mu=random.sample(df,10) 

it gave index out of range error.

I converted the same into a numpy array and then did

mu=random.sample(np.array(df).tolist(),10)

instead of giving 10 values as mean it is giving me 10 arrays of values.

How can I get a 10 values to initialise as mean for 10 clusters from the dataframe?

DYZ
  • 55,249
  • 10
  • 64
  • 93

2 Answers2

4

I think you need DataFrame.sample:

mu = df.sample(10) 

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(20,3)), columns=list('abc'))
print (df)
    a  b  c
0   8  8  3
1   7  7  0
2   4  2  5
3   2  2  2
4   1  0  8
5   4  0  9
6   6  2  4
7   1  5  3
8   4  4  3
9   7  1  1
10  7  7  0
11  2  9  9
12  3  2  5
13  8  1  0
14  7  6  2
15  0  8  2
16  5  1  8
17  1  5  4
18  2  8  3
19  5  0  9
mu = df.sample(10)
print (mu)
    a  b  c
11  2  9  9
1   7  7  0
8   4  4  3
5   4  0  9
2   4  2  5
19  5  0  9
13  8  1  0
14  7  6  2
0   8  8  3
9   7  1  1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
4

Use numpy.random.choice

df.iloc[np.random.choice(np.arange(len(df)), 10, False)]

Or numpy.random.permutation

df.loc[np.random.permutation(df.index)[:10]]

    a  b  c
11  2  9  9
1   7  7  0
16  5  1  8
15  0  8  2
17  1  5  4
19  5  0  9
10  7  7  0
8   4  4  3
6   6  2  4
14  7  6  2
piRSquared
  • 285,575
  • 57
  • 475
  • 624