I am trying to implement a clustering on a large dataset consisting of 146,000 observations, using the HDBSCAN algorithm. When I cluster these observations with the (default) Minkowski/Euclidean distance measure, clustering of the entire data goes fine and takes only 8 seconds. However, I am trying to perform clustering with my own metric. This goes fine when operating on a subset of the data, although a lot slower. However, when trying to implement it on the full dataset I immediately get a memory error. This makes sense, given that due to the size of the dataset a pairwise distance matrix would take up around 150GB. However, this makes me wonder how there is no such issue using the default metric, while looking at the HDBSCAN source code show that in that case also Sklearn's Pairwise distances is called, which will return the entire matrix. Additionally, I am wondering if there would be a workaround for my metric or if the only solution is to have access to +- 150GB of RAM.
The code for my metric and some results:
import hdbscan
import pandas as pd
import time
import numpy as np
from numpy.linalg import norm
def spex_distance(a, b):
euclidean = norm(a[:2] - b[:2])
exp_vec_a, exp_vec_b = a[2:], b[2:]
cos_sim = np.dot(exp_vec_a, exp_vec_b) / (norm(exp_vec_a) * norm(exp_vec_b))
if cos_sim > 0:
return euclidean / cos_sim
else:
return np.inf
def main():
data = pd.read_pickle(file_location)
small_data = data[:1000]
t0 = time.time()
hdb_custom = hdbscan.HDBSCAN(metric=spex_distance)
hdb_custom.fit(small_data)
print(f"Time needed for clustering subset with custom metric: {time.time()-t0}") # 10 sec
t0 = time.time()
hdb_default = hdbscan.HDBSCAN()
hdb_default.fit(small_data)
print(f"Time needed for clustering subset with default metric: {time.time()-t0}") # 0.03 sec
t0 = time.time()
hdb_default.fit(data)
print(f"Time needed for clustering full dataset with default metric: {time.time()-t0}") # 9 sec
hdb_custom.fit(data) # fails with memory error