If the data to cluster are literally points (either 2D (x, y)
or 3D (x, y,z)
), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means
may not be a wise choice here, whereas DBSCAN
seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>
. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN
is superior to K-means
in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?