3

I'm doing a project analyzing page visits to an e-commerce website. It monitors numerical, numerical discrete (continuous numbers but only integers), and categorical variables.

My understanding is that due to KMeans' nature of taking means and performing calculations on the numbers/distances, it does not work very well with categorical variables. I also don't think it works well with numerical discrete values because it will interpret them using decimals when there shouldn't be fractions of these discrete values.

Here is the code for how I run sklearn's KMeans, measuring k clusters with silhouette score and using the highest score's k clusters. I create a dataframe called cluster_df of only the numerical features from my original dataframe, and then separate dataframes for each cluster:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cluster_df[cluster_attribs] = scaler.fit_transform(cluster_df[cluster_attribs])

k_rng = range(2,10)
silhouette = []
for k in k_rng:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(cluster_df[cluster_attribs])
    silhouette.append(silhouette_score(cluster_df[cluster_attribs], kmeans.labels_))

kmeans = KMeans(n_clusters=3)
y_pred = kmeans.fit_predict(cluster_df[cluster_attribs])
cluster_df['cluster'] = y_pred
# inverse StandardScaler to return values to normal
cluster_df[cluster_attribs] = scaler.inverse_transform(cluster_df[cluster_attribs])

cluster0 = cluster_df[cluster_df.cluster==0]
cluster1 = cluster_df[cluster_df.cluster==1]
cluster2 = cluster_df[cluster_df.cluster==2]

I then perform data visualizations/analysis based on these 3 clusters. It seems to work pretty well clustering the data, and even when viewing the categorical data it seems to be clustered with those in mind even though they weren't included in the actual clustering.

For instance, Revenue is a binary column I didn't include in KMeans. But my 3 clusters seem to have separated my customers well into low-revenue, medium-revenue, and high-revenue just by running it on the numerical variables.

My questions are:

1) Is it true that KMeans only works well with numerical data, not discrete numerical or categorical data? (I've read there are ways to convert categorical variables to numerical but it seemed complicated and not reliably accurate due to its nature for this project. I know OneHotEncoder/LabelEncoder/MultiLabelBinarizer but I mean converting them keeping the categories' distances from each other in mind which is more complicated).

2) Is it an acceptable strategy to run KMeans on just your numerical data, separate into clusters, and then pull insights on your data's clusters for all of your variables (numerical, discrete numerical, categorical) by seeing how they've been separated?

Greg Rosen
  • 197
  • 10
  • It might be worth mentioning: If a subset of your features can reliably predict the values for some of your other features, you may want to look into feature elimination, or dimensionality reduction techniques – G. Anderson Sep 11 '19 at 16:27
  • That's true, it does cluster my data in an understandable way as is, but I have no way of knowing whether it would be clustered much "better" had I used more features. So it may be the case those features are all I need, or it may be the case that I'm missing out on a lot better clusters. No way to really know if I can't run it both ways and compare. – Greg Rosen Sep 11 '19 at 16:30

1 Answers1

1

1)

  • I normaly convert them to using oneHot and then I divide the values for n being n the number of uniques in that category, normaly this work fine. In this case you will have more n-1 columns for each categorical column that you already have
  • If you have ordinal values the use LabelEncoder and then divided them as I exlained before. In this case you will keep the same number of columns

2)

  • If your dataset runs fine without categorical data, why not? But I would advise you to test more possibilities
Jose Macedo
  • 308
  • 1
  • 11
  • Awesome, I'll try this. A couple questions for you: 1)After OneHotEncoder (with drop='first' for dummy variable trap), how do you know which columns are which category and divide them all? 2) do you divide by n categories or n-1 since we drop dummy variable for each category? (or is dropping not necessary for this) 3) Is there an easy way to convert back from OneHotEncoder and LabelEncoder to analyze the data in its original form? – Greg Rosen Sep 11 '19 at 16:33
  • If you are using the pandas get_dummies, the orignial variale will be on the name, just need to filter. You divide by n being n the number of categories on that variable. For example if you have the states of USA in a variable and you encode them, you will divide each new variable by 50. You need to drop the original variable. In Sklearn you have the function inverse_transform. If you are using get_dummies i think you can use something like https://stackoverflow.com/questions/50607740/reverse-a-get-dummies-encoding-in-pandas – Jose Macedo Sep 11 '19 at 18:10
  • You can try to use the Kmodes as well, for the categorical ones, it will go through the mode, it's really nice https://github.com/nicodv/kmodes – Jose Macedo Sep 11 '19 at 18:12
  • See k-prototypes as well, it's a combination of kmodes and kmeans. If already have your answer please market answered plz :p – Jose Macedo Sep 11 '19 at 18:19