Scatter plots in python to represent the points closer to centroids for K-mean clustering

Question

I am writing a simple K-means algorithm for clustering and I am trying to render a scatter plot showing sample data ( rows of a sample data loaded from a CSV file into a numpy matrix X).

Let us say X is a numpy matrix with each row containing the example data with 10 features. for my case they are attributes of a network flow containing src IP address, destination IP address , src port or destination port. I have also computed the centroids for K-mean ( where K is the total centroids). I have an list idx which is nothing but indices of the centroid to which individual X-row belongs. for example if row 5 of X numpy matrix belongs to centroid =3, will have an idx[4]=3 ( since we start from 0). With this , each row of X, containing individual data record of 10 features belongs to unique centroid. I want to draw scatter plot the data points in X coloring them separately for each centroid. for example if row 5, 8 of X is closer to centroid 3, I want to color them with a different color. if I were to do it in Octave, I could have written the code like this:-

function plotPoints(X,idx,K)
  p= hsv(K+1) % palette
  c= p(idx,:) % color
  scatter(X(:,1),X(:,2),15,c) % plot the scatter plot

However in python , I am not sure how to implement the same so that I can show data samples with the same index assignment have the same color. My code currently is shows all the X rows in red and all the centroids in Blue as shown below:-

def plotPoints(X,idx,K,centroids):
    srcport=X[:,5]
    dstport=X[:,6]

    fig = plt.figure()
    ax=fig.add_subplot(111,projection='3d')
    ax.scatter(srcport,dstport,c='r',marker='x')
    ax.scatter(centroids[:,5],centroids[:,6],c='b',marker='o', s=160)
    ax.set_xlabel('Source port')
    ax.set_xlabel('Destination port')
    plt.show()

Please note: I am only plotting 2 features on x & y axis and not all of the 10 features. I should have mentioned that earlier.

[This post](http://stackoverflow.com/questions/26139423/plot-different-color-for-different-categorical-levels-using-matplotlib) discusses a variety of options that may be useful for you. — andrew_reece, May 06 '17 at 02:37
In that case, see my answer below - it should achieve what you're going for, minus the 3D. — andrew_reece, May 07 '17 at 00:30

andrew_reece · Accepted Answer · 2017-05-06T02:47:00.743

Seaborn and Pandas work well together for this kind of plotting.
If they're available to you, consider the following solution:

# generate sample data
import numpy as np
values = np.random.random(500).reshape(50,10) * 10
centroid = np.random.choice(np.arange(5), size=50).reshape(-1,1)
data = np.concatenate((values, centroid), axis=1)

# convert to DataFrame
import pandas as pd
colnames = ['a','b','c','d','e','f','g','h','i','j','centroid']
df = pd.DataFrame(data, columns=colnames)

# data frame looks like:
df.head()

   a  b  c  d  e  f  g  h  i  j  centroid
0  6  9  9  9  1  2  4  0  8  9         4
1  9  1  0  0  7  9  9  3  7  2         1
2 10  4  8  7  2  8  9  4  6  8         3
3  2  6  5  2  8  4  9  3  9  5         4
4  9  7  5  1  3  2  1  8  3  4         4

# plot with Seaborn
import seaborn as sns
sns.lmplot(x='a', y='b', hue='centroid', data=df, scatter=True, fit_reg=False)

Here's a pure Numpy/Pyplot version, if you're restricted to those modules:

from matplotlib import pyplot as plt
fig, ax = plt.subplots()

colors = {0:'purple', 1:'red', 2:'blue', 3:'green', 4:'black'}

ax.scatter(x=data[:,0], y=data[:,1], c=[colors[x] for x in data[:,10]])

Thanks Andrew_reece for your response. The challenge I have with this solution is that I don't know in advance how many centroids I may want to start with in advance. I run the cost function to determine the most optimal cost and consider that as my value of K. Therefore if I use a static dictionary of colors, it will not scale .. If you see my octave code, I pick the colors from a palette define by K. — sunny, May 08 '17 at 03:33
That's only an issue with Matplotlib - the Pandas/Seaborn solution will just scale the number of colors automatically to the number of centroids you have in your `centroid` vector. You can use a `cmap` instead of a static color mapping, if you want to use the Matplotlib solution. — andrew_reece, May 08 '17 at 03:38

score 2 · Answer 2 · edited May 23 '17 at 12:26

2

Check out the answer to post Scatter plot and Color mapping in Python. I guess your centroids' indices correspond to clusters. In this case you can either use a simple array as colors:

ax.scatter(srcport, dstport, c=idx, marker='x')
ax.scatter(centroids[:,5], centroids[:,6], c=np.arange(K), marker='o', s=160)

or use colormap:

ax.scatter(srcport, dstport, c=plt.cm.viridis(idx / K), marker='x')
ax.scatter(centroids[:,5], centroids[:,6], c=plt.cm.viridis(np.arange(K) / K),
            marker='o', s=160)

edited May 23 '17 at 12:26

Community

1
1

answered May 09 '17 at 18:46

Vadim Shkaberda

2,807
19
35

Thanks, that makes sense!. Let me try it out. – sunny May 10 '17 at 06:31

Scatter plots in python to represent the points closer to centroids for K-mean clustering

2 Answers2