0

I want to cluster huge datasets but the bottleneck is the parameter tuning without visual checking.

Ex: K-means

I shouldn't try from 1 to N cluster if I have N samples, right? It's too brute force.

But what's range I should try? From 1 to N/4? or N/8? or the slope changing ratio?

In another words, how to determine the number of cluster without checking elbow point by my eyes?

Ex: DBSCAN

Follow here , choose the k-distance but is there a theory to help me decide the range of k?

Someone said using k-nn to help DBSCAN, but the k of k-nn is a complicated problem. How should I choose the range of k?

From above, I want to ask for help.

Anyone has the experience to cluster datasets and find a nearer (no need to very accurate) point to decide the parameter without visual checking?

code_worker
  • 384
  • 3
  • 16

1 Answers1

2

Use the Elbow Method :

You perform k means for k clusters, compute the sum of distance of each point and the center of the cluster it belongs. (inertia attribute in Sklearn). You plot it for different values of k, for some values of k you should see no improvements (that's the elbow).

Elbow Method

user1672455
  • 152
  • 6