1

Imagine I have a dataset as follows:

[{"x":20, "y":50, "attributeA":90, "attributeB":3849},
 {"x":34, "y":20, "attributeA":86, "attributeB":5000},
 etc.

There could be a bunch more other attributes in addition to these - this is just an example. What I am wondering is, how can I cluster these points based on all of the factors with control over the maximum separation between a given point and the next for a given variable for it to be considered linked. (i.e. euclidean distance must be within 10 points, attributeA within 5 points and attributeB within 1000 points)

Any ideas on how to do this in python? As I implied above, I would like to apply euclidean distance to compare distance between the two points if possible - not just comparing x and y as separate attributes. For the rest of the attributes it would be all single dimensional comparison...if that makes sense.


Edit: Just to add some clarity in case this doesn't make sense, basically I am looking for some algorithm to compare all objects with each other (or some more efficient way), if all of object A's attributes and euclidean distance are within the specified threshold when compared to object B, then those two are considered similar and linked - this procedure continues until eventually all the linked clusters can be returned as some clusters will have no points that satisfy the conditions to be similar to any point in another cluster resulting in the clusters being separated.

abagshaw
  • 6,162
  • 4
  • 38
  • 76

1 Answers1

2

The simplest approach is to build a binary "connectivity" matrix.

Let a[i,j] be 0 exactly if your conditions are fullfilled, 1 otherwise.

Then run hierarchical agglomerative clustering with complete linkage on this matrix. If you don't need every pair of objects in every cluster to satisfy your threshold, then you can also use other linkages.

This isn't the best solution - other distance matrix will need O(n²) memory and time, and the clustering even O(n³), but the easiest to implement. Computing the distance matrix in Python code will be really slow unless you can avoid all loops and have e.g. numpy do most of the work. To improve scalability, you should consider DBSCAN, and a data index.

It's fairly straightforward to replace the three different thresholds with weights, so that you can get a continuous distance; likely even a metric. Then you could use data indexes, and try out OPTICS.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks, That's really helpful. The idea about using DBSCAN with a continuous distance from weights seems quite interesting - how would it work though if there is one attribute (or multiple attributes) that absolutely has to be equal (a string attribute that must be the same on both points for the points to be considered connected), how would that work with the weights idea? I'd assume the simplest way would be to split up my points into separate groups for each different string attribute...but if there are multiple attributes that have to be equal it doesn't seem like that would work. – abagshaw Mar 27 '17 at 14:11
  • You can give infinite distance then, or use Generalized DBSCAN: Sander, Jörg, et al. "Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications." Data mining and knowledge discovery 2.2 (1998): 169-194. But **splitting the data is a good idea because of efficiency**, and it does work with multiple must-be-identical attributes. – Has QUIT--Anony-Mousse Mar 27 '17 at 15:19