0

I have written code in python to implement DBSCAN clustering algorithm. My dataset consists of 14k users with each user represented by 10 features. I am unable to decide what exactly to keep as the value of Min_samples and epsilon as input How should I decide that? Similarity measure is euclidean distance.(Hence it becomes even more tough to decide.) Any pointers?

hlkstuv_23900
  • 864
  • 20
  • 34
Maxwell
  • 409
  • 1
  • 6
  • 19
  • Evaluate the Euclidean distance on your data set. Does it work? What is a sensible similarity threshold? Then use this threshold as epsilon for DBSCAN. – Has QUIT--Anony-Mousse Apr 15 '12 at 18:57
  • How should I evaluate euclidean distance on my dataset? – Maxwell Apr 15 '12 at 20:23
  • @Anony-Mousse: I was thinking of this: Would it make sense to normalize the euclidean distances within 0-1. Now the distances might go up to something like 10k+ which make sit difficult to decide threshold. But I am not sure how to normalize it. Any ideas? – Maxwell Apr 15 '12 at 21:58
  • 2
    You might want to read up on the curse of dimensionality, and use some entirely different distance function. Euclidean distance makes sense in the physical world, but not in arbitrary spaces. – Has QUIT--Anony-Mousse Apr 16 '12 at 05:25

1 Answers1

0

DBSCAN is pretty often hard to estimate its parameters.

Did you think about the OPTICS algorithm? You only need in this case Min_samples which would correspond to the minimal cluster size.

Otherwise for DBSCAN I've done it in the past by trial and error : try some values and see what happens. A general rule to follow is that if your dataset is noisy, you should have a larger value, and it is also correlated with the number of dimensions (10 in this case).

Charles Menguy
  • 40,830
  • 17
  • 95
  • 117