0

I am applying K-means clustering on a pandas dataframe. The cluster assignment function is as follows:

def assign_to_cluster(row):
    lowest_distance = -1
    closest_cluster = -1

    for cluster_id, centroid in centroids_dict.items():
        df_row = [row['PPG'],row['ATR']]
        euclidean_distance = calculate_distance(centroids, df_row)

        if lowest_distance == -1:
            lowest_distance = euclidean_distance
            closest_cluster = cluster_id
        elif euclidean_distance < lowest_distance:
            lowest_distance = euclidean_distance
            closest_cluster = cluster_id
    return closest_cluster

point_guards['CLUSTER'] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)

But I get the following error while using the lambda function:

   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return         self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item     (pandas\hashtable.c:12368)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()

KeyError: (0, 'occurred at index 0')

Can someone please provide an explanation as to the reason of the error and how I can solve it? If you need additional information, please reply to this post. And apologies for the formatting. This is my first time asking a question in StackOverflow.

  • what is point_guards.head()? – parsethis Feb 21 '17 at 15:21
  • see: http://stackoverflow.com/questions/16353729/pandas-how-to-use-apply-function-to-multiple-columns – parsethis Feb 21 '17 at 15:24
  • @putonspectacles: point_guards is the name of the pandas dataframe I am working on. head() function prints the first 10 rows of the dataframe. At least, that's what I think it does. – Aditya Gogoi Feb 26 '17 at 16:51
  • @putonspectacles: Thanks for the Thread. However, I did find out that I had made a small syntax error. I have written the solution below. :) – Aditya Gogoi Feb 26 '17 at 16:53

1 Answers1

0

It turns out that I had made a simple syntax error. Instead of using the 'centroid' part of the dictionary 'centroid_dict.items()' while calling the function 'calculate_distance':

for cluster_id, centroid in centroids_dict.items():
    df_row = [row['PPG'],row['ATR']]
    euclidean_distance = calculate_distance(centroid, df_row)
....

I used 'centroids' instead:

for cluster_id, centroid in centroids_dict.items():
    df_row = [row['PPG'],row['ATR']]
    euclidean_distance = calculate_distance(centroids, df_row)

It is solved now though.