Questions tagged [hdbscan]

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.1 It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

In 2014, the algorithm was awarded the test of time award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, KDD.

81 questions
14
votes
4 answers

how do I solve " Failed building wheel for hdbscan "?

I tried to download Hdbscan using pip install hdbscan , I get this : ERROR: Failed building wheel for hdbscan ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly I've tried several solutions, it didn't work…
Omar Hossam
  • 311
  • 1
  • 2
  • 9
8
votes
1 answer

HDBSCAN difference between parameters

I'm confused about the difference between the following parameters in HDBSCAN min_cluster_size min_samples cluster_selection_epsilon Correct me if I'm wrong. For min_samples, if it is set to 7, then clusters formed need to have 7 or more…
7
votes
1 answer

Issue with hdbscan (ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject)

I know a number of people have posted about this before but I still can't resolve my error. I'm trying to import hdbscan but it keeps returning the following…
code_learner93
  • 571
  • 5
  • 12
6
votes
3 answers

How to resolve ERROR: Could not build wheels for hdbscan, which is required to install pyproject.toml-based projects

I am trying to install bertopic and I got this error: pip install bertopic Collecting bertopic > Using cached bertopic-0.11.0-py2.py3-none-any.whl (76 kB) > Collecting hdbscan>=0.8.28 > Using cached…
SamanthaK
  • 67
  • 1
  • 1
  • 5
5
votes
1 answer

How do I use sklearn.metrics.pairwise pairwise_distances with callable metric?

I'm doing some behavior analysis where I track behaviors over time and then create n-grams of those behaviors. sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'], ['scratch', 'scratch', 'scratch',…
not-bob
  • 815
  • 1
  • 8
  • 23
4
votes
1 answer

hdbscan error: TypeError: 'numpy.float64' object cannot be interpreted as an integer

I ran hdbscan function code both on Linux and google colab and got the same error TypeError: 'numpy.float64' object cannot be interpreted as an integer error seems to happen when applying data to the 'fit_predict' function code comes from hdbscan…
Sotiris
  • 41
  • 2
4
votes
1 answer

DBSCAN or HDBSCAN is better option? and why?

which clustering method is considered to be the best among DBSCAN and HDBSCAN and what is the reason behind that?
4
votes
1 answer

Problems with HDBSCAN and approximate predict

I would like to use the HDBSCAN clustering technique to predict outliers. I have trained my model to optimize the parameters, but then, when I apply approximate_predict on new data, I get different clusters and labels that I have in my original…
4
votes
2 answers

What is the appropriate distance metric when clustering paragraph/doc2vec vectors?

My intent is to cluster document vectors from doc2vec using HDBSCAN. I want to find tiny clusters where there are semantical and textual duplicates. To do this I am using gensim to generate document vectors. The elements of the resulting docvecs are…
fluffet
  • 43
  • 6
3
votes
0 answers

Clustering with UMAP and HDBScan

I have a somewhat large amount of textual data, input by approximately 5000 people. I've assigned each person a vector using Doc2vec, reduced to two dimensions using UMAP and highlighted groups contained within using HDBSCAN. The intention is to…
Jacob
  • 53
  • 5
3
votes
0 answers

Problem with hdbscan used with bertopic: OSError: [Errno 22] Invalid argument

I am writing because I have a problem (silly and obvious introduction, I know). I am trying to use the BERTopic package using the Python interpreter in RStudio and the reticulate extension: Python 3.6.13…
Francis
  • 31
  • 1
3
votes
1 answer

HDBSCAN for R Crashed with large dataset

I tried to apply HDBSCAN algorithm to my dataset (50000 GPS points). However, every time I run the code, the R session is crashed. Here is the basic info. about my PC: processor: Intel i7 7820x 3.6 GHz memory: 120 GB System: 64-bit Operating system,…
Yunzhe Liu
  • 93
  • 5
2
votes
1 answer

TypeError issue importing hdbscan

Python 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 17:59:51) [MSC v.1935 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import hdbscan Traceback (most recent call last): File…
Nathan Luo
  • 41
  • 2
2
votes
1 answer

Scikit HDBSCAN *tree* labeling (not single-slice labeling)

BLUF: For a specific epsilon (or for HDBSCAN's 'favorite' epsilon), I can extract the mapping of my data in that epsilon's partition. But how can I see my data's full tree membership? I've gotten a ton out of the terrific tutorial here. In scikit…
2
votes
1 answer

HDBSCAN handling of large datasets

I am trying to implement a clustering on a large dataset consisting of 146,000 observations, using the HDBSCAN algorithm. When I cluster these observations with the (default) Minkowski/Euclidean distance measure, clustering of the entire data goes…
statsguy96
  • 37
  • 7
1
2 3 4 5 6