8

I'm confused about the difference between the following parameters in HDBSCAN

  1. min_cluster_size
  2. min_samples
  3. cluster_selection_epsilon

Correct me if I'm wrong.

For min_samples, if it is set to 7, then clusters formed need to have 7 or more points. For cluster_selection_epsilon if it is set to 0.5 meters, than any clusters that are more than 0.5 meters apart will not be merged into one. Meaning that each cluster will only include points that are 0.5 meters apart or less.

How is that different from min_cluster_size?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
HR1
  • 487
  • 4
  • 14
  • The default is for `min_samples` to be set to `min_cluster_size`, but selecting a smaller `min_samples` can have a "dramatic effect on clustering" as shown in [the documentation](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html). – rickhg12hs Jun 09 '21 at 17:49

1 Answers1

20

They technically do two different things.

min_samples = the minimum number of neighbours to a core point. The higher this is, the more points are going to be discarded as noise/outliers. This is from DBScan part of HDBScan.

min_cluster_size = the minimum size a final cluster can be. The higher this is, the bigger your clusters will be. This is from the H part of HDBScan.

Increasing min_samples will increase the size of the clusters, but it does so by discarding data as outliers using DBSCAN.

Increasing min_cluster_size while keeping min_samples small, by comparison, keeps those outliers but instead merges any smaller clusters with their most similar neighbour until all clusters are above min_cluster_size.

So:

  1. If you want many highly specific clusters, use a small min_samples and a small min_cluster_size.
  2. If you want more generalized clusters but still want to keep most detail, use a small min_samples and a large min_cluster_size
  3. If you want very very general clusters and to discard a lot of noise in the clusters, use a large min_samples and a large min_cluster_size.

(It's not possible to use min_samples larger than min_cluster_size, afaik)

user3252344
  • 678
  • 6
  • 14
  • I cannot see your remark about min_samples and min_cluster_size inside the HDBSCAN documentation. Is there a reason why min_samples should not be larger than min_cluster_size? – abolfazl taghribi Jan 30 '23 at 22:16
  • 1
    @abolfazltaghribi it’s part of the hdbscan algorithm, which doc probably doesn’t explain fully. First pass does density based calc, db clusters use min_samples. Second pass does hierarchical clusters: if db cluster n_samples < min_cluster_size, merge with nearest neighbour cluster. If n_samples is always over min_cluster_size, second pass does nothing. So you end up with a dbscan instead of a hdbscan. It will still run it just isn’t using half the model. – user3252344 Feb 01 '23 at 05:21