I am trying to write K-means clustering program, which needs euclidean distances in it. I understand how it works when the data is stored in a list, like the code below.
for featureset in data:
distances = [np.linalg.norm(featureset - self.centroids[centroid]) for centroid in self.centroids]
cluster_label = distances.index(min(distances))
But my dataset is very big (around 4 million rows) so using list or array is definitely not very efficient. I want to store the data in dataframe instead. I am thinking of iterating each row of data
and do the euclidean calculation, but it doesn't seem so efficient, even if I am using iteruples()
or iterrows
. I am wondering if there is any more efficient way to do that.