Suitability of cluster analysis

Question

I have a large number of objects, for which I calculate 4 percentage differences between each pair.

For example: O1 and O2 have differences: a12, b12, c12 and d12 e.g. 51, 78, 22, 93.

I wish to flag 'close' objects, that differ by less than some threshold. ( I do not yet know how 'important' each of the 4 measures is.)

Is cluster analysis a suitable method to solve this? Any pointers to Python algorithms and beginners' tutorials would be very helpful.

This does not sound like cluster analysis. You already defined your objective: "connect" objects where similarity is below a threshold. But you are very vague, it's hard to answer. — Has QUIT--Anony-Mousse, Sep 26 '14 at 18:56

tom10 · Accepted Answer · 2014-09-26T16:06:25.830

This is probably not a cluster analysis problem.

The basic distinction is whether your are grouping all points by some criteria of connectedness (this not clustering), or does the algorithms determine the criteria for the clusters dynamically from the data themselves (this is clustering)?

The defining issue with typical cluster analysis is that the cluster definitions are based on the data only. That is, the algorithms create the clusters and the definitions for those clusters within the same process. Or, put another way, when you begin clustering, you give the algorithm the data, but you don't give it a threshold.

Since you have thresholds already, this isn't what's typically referred to as clustering. Even if you have multiple thresholds to choose between, just group the data as your thresholds will dictate and compare the groupings.

The caveat here is exactly what you mean by "threshold" and how you want to apply it. If you want to find all points that establish a chain of connected points less than some threshold, it's not clustering. If, instead, you want the threshold to define a non-linear metric between points, then the normal clustering algorithms would apply (though with a very unusual metric -- so this probably isn't the approach you want).

And the other caveat is that by "cluster" people can mean different things, and I think I'm using the usual data analysis definition, though people use the word in other ways too, of course. See, for example, the algorithms in scipy.cluster.

As for what approach to take, so far you have not described enough details to answer that. For example, do you want to replace closest pairs with their median?, or follow a connected chains of neighbors?, etc. Maybe something like a KDTree would be useful to you.

KDtree needs *numerical* data; not similarities. But I do agree that this does not sound like a cluster analysis problem. — Has QUIT--Anony-Mousse, Sep 26 '14 at 18:54
@Anony-Mousse: Of course, but it's not at all clear that the OPs data isn't in a sufficiently numerical form (or if it is clear, please actually explain). If you have a better answer just post it, but what's the point of creating straw men out of mine? — tom10, Sep 26 '14 at 19:10
Thanks Tom10, this has made it much clearer. I think I will try Python's CKDTree library like the example, [here](http://stackoverflow.com/questions/10818546/finding-index-of-nearest-point-in-numpy-arrays-of-x-and-y-coordinates) — schoon, Sep 28 '14 at 18:25

Suitability of cluster analysis

1 Answers1