0

I have a dataset with 28000 records. The data is of an e-commerce store menu items. The challenge is the following:

Multiple stores have similar products but with different names. For example, 'HP laptop 1102' is present in different stores as 'HP laptop 1102', 'Hewlett-Packard laptop 1102', 'HP notebook 1102' and many other different names.

I have opted to convert the product list as a tfidf vector and use KMeans clustering to group similar products together. I am also using some other features like product category, sub category etc. (I have one hot encoded all the categorical features)

Now my challenge is to estimate the optimal n_clusters in KMeans algorithm. As the clustering should occur at product level, I'm assuming I need a high n_clusters value. Is there any upper limit for the n_clusters?

Also any suggestions and advice on the solution approach would be really helpful. Thanks in advance.

Praneeth Vasarla
  • 113
  • 1
  • 1
  • 9
  • you are optimising for k, so you could try an approach similiar to this one here: https://stackoverflow.com/questions/53075481/how-do-i-cluster-a-list-of-geographic-points-by-distance/53179675#53179675 – vencaslac Jan 20 '21 at 07:57

2 Answers2

0

You are optimising for k, so you could try an approach similar to this one here: how do I cluster a list of geographic points by distance?

As for max k, you can only every have as many clusters as you do datapoints, so try using that as your upper bound

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
vencaslac
  • 2,727
  • 1
  • 18
  • 29
0

The upper limit is the number of data points, but you almost surely want a number a good bit lower for clustering to provide any value. If you have 10,000 products I would think 5,000 clusters would be a rough maximum from a usefulness standpoint.

You can use the silhouette score and inertia metrics to help determine the optimal number of clusters.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.... The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. - from the scikit-learn docs

inertia_ is an attribute of a fitted clustering object in scikit-learn - not a separate evaluation metric. It is the "Sum of squared distances of samples to their closest cluster center." - see the KMeans clustering docs in scikit-learn, for example.

Note that inertia increases as you add more clusters, so you may want to use an elbow plot to visualize where the change becomes minimal.

jeffhale
  • 3,759
  • 7
  • 40
  • 56