They technically do two different things.
min_samples
= the minimum number of neighbours to a core point. The higher this is, the more points are going to be discarded as noise/outliers. This is from DBScan part of HDBScan.
min_cluster_size
= the minimum size a final cluster can be. The higher this is, the bigger your clusters will be. This is from the H part of HDBScan.
Increasing min_samples
will increase the size of the clusters, but it does so by discarding data as outliers using DBSCAN.
Increasing min_cluster_size
while keeping min_samples
small, by comparison, keeps those outliers but instead merges any smaller clusters with their most similar neighbour until all clusters are above min_cluster_size
.
So:
- If you want many highly specific clusters, use a small
min_samples
and a small min_cluster_size
.
- If you want more generalized clusters but still want to keep most detail, use a small
min_samples
and a large min_cluster_size
- If you want very very general clusters and to discard a lot of noise in the clusters, use a large
min_samples
and a large min_cluster_size
.
(It's not possible to use min_samples larger than min_cluster_size, afaik)