1

Is there anyway in sklearn to allow for higher dimensional clustering by the DBSCAN algorithm? In my case I want to cluster on 3 and 4 dimensional data. I checked some of the source code and see the DBSCAN class calls the check_array function from the sklearn utils package which includes an argument allow_nd. By default it is set to false and there doesn’t seem to be a way to set it through the DBSCAN class constructor. Any thoughts/ideas or am I missing something simple? Thanks!

EDIT: Minimal code (I am using sklearn version 0.20.2).

import numpy as np
from sklearn.cluster import DBSCAN

data = np.random.rand(128, 416, 1)
db = DBSCAN()
db.fit_predict(data)

This is a sample but it works on any real data that I load as well. Here is the exact error returned:

ValueError: Found array with dim 3. Estimator expected <= 2.

Here is the shape and ndim of the ndarray above.

(128, 416, 1)
3
  • 1
    There is no restrictions in `sklearn`'s `DBSCAN` on number of dimensions out of box. – Sergey Bushmanov Feb 23 '19 at 04:43
  • 1
    There is a hard check in the check_array method called allow_nd which is set to False by default. When I try to pass an np.ndarray with more than 2 dimensions I receive an error specially on the dimensions. – Travis Couture Feb 23 '19 at 06:15
  • What is the `.shape` of your data? Do you mean to cluster tensors? With what distance? Also: you have the source - you can remove the check and see if then works for you... – Has QUIT--Anony-Mousse Feb 23 '19 at 07:46
  • I've tried this on both random numpy generated data and legitimate image data. I'll add the minimal code to the original post. – Travis Couture Feb 23 '19 at 15:26
  • Since your last dimensionality is 1 anyway, why can't you reshape this to `(128, 416)`? **What distance do you use** where reshaping is not equivalent? – Has QUIT--Anony-Mousse Feb 24 '19 at 22:35

1 Answers1

2

DBSCAN indeed does not have restrictions on data dimensionality.

Proof:

from sklearn.cluster import DBSCAN
import numpy as np
np.random.seed(42)
X = np.random.randn(100).reshape((10,10))
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
clustering.labels_
array([ 0,  0,  0, -1,  0, -1, -1, -1,  0,  0])

Your real problem is that you're trying to feed 3d dimensional image data to a 2d algo.

In your situation you have several courses of action:

  1. Cast your data to 2d (check out this and this )
  2. Reopen your issue with properly defining the root of your problem and what you want.
  3. Try your luck with recompiling the source with allow_nd=True
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72