2

I have a dataset containing 61 rows(users) and 26 columns, on which I apply clustering with k-means and others algorithms. first applied KMeans on the dataset after normalizing it. As a prior task I run k-means on this data after normalizing it and identified 10 clusters. In parallel I also tried to visualize these clusters that's why i use PCA to reduce the number of my features.

I have written the following code:

UserID  Communication_dur   Lifestyle_dur   Music & Audio_dur   Others_dur  Personnalisation_dur    Phone_and_SMS_dur   Photography_dur Productivity_dur    Social_Media_dur    System_tools_dur    ... Music & Audio_Freq  Others_Freq Personnalisation_Freq   Phone_and_SMS_Freq  Photography_Freq    Productivity_Freq   Social_Media_Freq   System_tools_Freq   Video players & Editors_Freq    Weather_Freq
1   63  219 9   10  99  42  36  30  76  20  ... 2   1   11  5   3   3   9   1   4   8
2   9   0   0   6   78  0   32  4   15  3   ... 0   2   4   0   2   1   2   1   0   0


from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(3) 
pca.fit(X) 
pca_data = pd.DataFrame(pca.transform(X)) 
print(pca_data.head())

gives the following results:

   0  1  2
0  8 -4  5
1 -2 -2  1
2  1  1 -0
3  2 -1  1
4  3 -1 -3

I want to show a plot (cluster) of my dataset by using a PCA and interpret the results ? I am really new in this space and advice would be greatly appreciated!

Thanks in advance once again.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
ab20225
  • 83
  • 2
  • 11

1 Answers1

5

Using an example dataset:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

df, y = make_blobs(n_samples=70, centers=10,n_features=26,random_state=999,cluster_std=1)

Perform scaling, PCA and put the PC scores into a dataframe:

Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(2) 
pca_data = pd.DataFrame(pca.fit_transform(X),columns=['PC1','PC2']) 

Perform kmeans and place the label into a data frame and you can already plot it using seaborn:

kmeans =KMeans(n_clusters=10).fit(X)
pca_data['cluster'] = pd.Categorical(kmeans.labels_)
sns.scatterplot(x="PC1",y="PC2",hue="cluster",data=pca_data)

enter image description here

Or matplotlib:

fig,ax = plt.subplots()
scatter = ax.scatter(pca_data['PC1'], pca_data['PC2'],c=pca_data['cluster'],cmap='Set3',alpha=0.7)
legend1 = ax.legend(*scatter.legend_elements(),
                    loc="upper left", title="")
ax.add_artist(legend1)

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • This error was raise: TypeError: data type not understood – ab20225 Feb 15 '21 at 10:06
  • which version of seaborn are you on. i am on '0.11.0'. Ok i add a matplotlib code – StupidWolf Feb 15 '21 at 10:17
  • Thank you for your answer! How to deal with overlapping groups. – ab20225 Feb 15 '21 at 10:49
  • hey.. that's another question and I cannot see your screen or your data to comment or help with that. Please post another question with reproducible data to get help – StupidWolf Feb 15 '21 at 11:35
  • one way is to reduce the transparency of your points to see them, there are a lot of post in SO, for example https://stackoverflow.com/questions/30108372/how-to-make-matplotlib-scatterplots-transparent-as-a-group – StupidWolf Feb 15 '21 at 11:36
  • If the issue is with the data, then it is something you need to work on. Again no one can see your data and help with troubleshoot, it's like blind driving. Please be fair to users who have spent time to provide answers! – StupidWolf Feb 15 '21 at 11:37
  • 3
    I have also noticed that you have never accepted a single answer. please see https://stackoverflow.com/help/someone-answers. SO is not a place for you to get other users to code for you!!! – StupidWolf Feb 15 '21 at 11:41
  • @StupidWolf Hi, may I kindly draw your attention to a similar [question](https://stackoverflow.com/questions/69077608/how-can-get-scatter-3d-plot-using-different-dataframes-to-set-ax-scatter-paramet). – Mario Sep 08 '21 at 11:05