2

I am trying to write K-means clustering program, which needs euclidean distances in it. I understand how it works when the data is stored in a list, like the code below.

for featureset in data:
    distances = [np.linalg.norm(featureset - self.centroids[centroid]) for centroid in self.centroids]
    cluster_label = distances.index(min(distances))

But my dataset is very big (around 4 million rows) so using list or array is definitely not very efficient. I want to store the data in dataframe instead. I am thinking of iterating each row of data and do the euclidean calculation, but it doesn't seem so efficient, even if I am using iteruples() or iterrows. I am wondering if there is any more efficient way to do that.

Chanda Korat
  • 2,453
  • 2
  • 19
  • 23
catris25
  • 1,173
  • 3
  • 20
  • 40
  • so each data frame has a column that contains tuples of points? – gold_cy Apr 19 '17 at 10:35
  • If I truly get what you mean, yes. Each row contains a set of data points. @DmitryPolonskiy – catris25 Apr 19 '17 at 11:03
  • @AnnaRG, please provide small (3-7 rows) reproducible sample data sets (in text/CSV format) and desired data set. [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU - stand with Ukraine Apr 19 '17 at 13:12

1 Answers1

1

When you calculate the distance in your list comprehension, centroid is already the element of the list self.centroids so no need to subscipt it again in your norm calculation. Probably the code you provided should be changed to something like this:

distances = [np.linalg.norm(featureset - centroid) for centroid in self.centroids]

However, if you use the np.array for data storage this might be more efficient:

cluster_label = np.linalg.norm(self.centroids - featureset, axis=1).argmin()

Lets define the function which will return the centroid label for some featureset:

def get_label(featureset):
    return np.linalg.norm(self.centroids - featureset, axis=1).argmin()

now we can apply this function along the entire dataset:

labels = np.apply_along_axis(get_label, 1, data)

In case data is too large for processing as single np.array you might split it on smaller peaces, process them separately and then concatenate the results.

Luchko
  • 1,123
  • 7
  • 15