5

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.

e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.

enter image description here

However, just as the scikit-learn website states:

While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional data.

AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.

e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.

So my question:

How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Sibbs Gambling
  • 19,274
  • 42
  • 103
  • 174

4 Answers4

6

"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).

4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.

However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.

If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.

  • Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.

The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
5

There are at least two common, generic approaches:

  1. One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):

    • PCA - principal component analysis
    • SOM - self-organizing maps
    • Sammon's mapping
    • Autoencoder Neural Networks
    • KPCA - kernel principal component analysis
    • Isomap

    After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.

  2. One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)

Community
  • 1
  • 1
lejlot
  • 64,777
  • 8
  • 131
  • 164
  • 1. even if one reduces the dimension to a visualizable extend (say 3D), it still seems to me making no sense to visualize the data sometimes. One example is say after reduction I have tuples . Does it make sense to cluster them in the 3D space? They possess different units. – Sibbs Gambling Sep 16 '13 at 08:42
  • By 2. do you mean the trial-and-error method? But my time is really limited. I may not be able to implement them all out and try one by one. – Sibbs Gambling Sep 16 '13 at 08:43
  • Clustering with "different units" is not the problem of dimensionality reduction but of clustering non unimodal data in general. Dimensionality reduction actually tries to overcome this issue by finding the projections that best fit the data no matter what they represent. In general clustering is rather an **exploration technique** then the solution. To reasonable split the data you **need to know** how do you want to do this, with what exact aim - but then you can in many cases build a rule based system rather then clustering one – lejlot Sep 16 '13 at 09:00
  • In general, there are no easy solutions, this is not an easy problem. The two outlined methods can help you, but none will "do your job". This are just some "out of the box solutions", if your time is limited - this is the only thing you can get. To do this "right" you would need lots of time and effort to perform real data analysis, involving many visualizations, statistical methods and hard work, unfortunately – lejlot Sep 16 '13 at 09:02
2

It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.

In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.

Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).

A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.

I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.

In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.

Community
  • 1
  • 1
rano
  • 5,616
  • 4
  • 40
  • 66
  • Weka doesn't have a lot of clustering. Better use ELKI. – Has QUIT--Anony-Mousse Sep 16 '13 at 11:35
  • it is not a problem of having 100 clustering algorithms, it's more a problem to have 100 different metrics to play with sometimes or 100 transformations/reductions and weka, as well as Elki, supports it : ) – rano Sep 16 '13 at 11:53
1

I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.