2

I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)

I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".

Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?

In general, when is S_Dbw index result "Inf"?

BlueBit
  • 397
  • 6
  • 22
  • 1
    Loosely speaking, `Inf` means infinite. You should post a reproducible example to help the community figure out why your specific code is producing `Inf`: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Drew Steen Nov 01 '12 at 13:08

2 Answers2

4

Be careful when comparing different algorithms with such an index.

The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.

Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.

Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.

Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.

In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.

Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.

Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
1

The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • Is it bad or good that my results are too large? for instance in my dataset (13,000 records with 10 attributes), for clustering number larger than 100, S_dbw gets "Inf" but for smaller than 100, it gets a real value. I think from my experimental results, S_Dbw may can not work with large cluster numbers because it failed for all clustering over 100 clusters. – BlueBit Nov 01 '12 at 14:01
  • 1
    You're going to have to read the package documentation carefully, and if that fails, contact the authors (respectfully! they don't get paid for this). If you're running 32-bit R, try 64-bit R also, as that will give you larger floats before they top out. – Ari B. Friedman Nov 01 '12 at 14:06