Finding best fit boxes of a scatter plot using python?

Question

I'm looking for the best python library to solve this problem:

I have a scatter plot with clumps over data points. This is just a series of x,y coordinate pairs.

I want a tool that will look at the data points I have, then suggest N 'boxes' that encompass the different groups.

Presumably I could go with higher or lower granularity by choosing how many boxes I wanted to use.

Are there any python libraries out there best suited to solve this type of problem?

When I read your question, the first thing that comes to my mind is "hierarchical clustering". The boxes can be deduced by the clusters they would represent. You can look at this question http://stackoverflow.com/questions/21638130/tutorial-for-scipy-cluster-hierarchy — Rerito, Jul 01 '14 at 20:53

score 1 · Answer 1 · edited May 23 '17 at 11:58

The way I understand your question, you want to find boxes that enclose clouds of data points. You define your granularity criterion as the number of boxes used to describe your data set.

I think what you are looking for is agglomerative hierarchical clustering. The algorithm is quite straight forward. Let n be the number of data points you have in the set. Basically, the algorithm starts by considering n groups, each one being populated by a single point. Then, it is an iterative process :

Merge the two closest groups according to a distance criterion
Since the groups set has changed, update the distances between the groups
Back to the merge step until either you reached a specific number of clusters or a specific distance threshold

You can also build the dendogram. It is a tree-like structure that will store the history of all the merging process, allowing you to retrieve any level of granularity between 1 cluster and n clusters.

There is a set of functions in Scipy that are dedicated to this algorithm. It is covered by the question Tutorial for scipy.cluster.hierarchy.

Getting the clusters is the first step, now you can build your boxes. Lets cover this in a so-called mathematical point of view. Let C be a cluster and P₁, ... P_n the points of the cluster. If a rectangular box is fine, then it can be defined by the two points of coordinates (x_min, y_min) and (x_max, y_max), with :

x_min = min (P.x P ∈ C)
y_min = min (P.y P ∈ C )
x_max = max (P.x P ∈ C )
x_max = max (P.y P ∈ C )

EDIT :

This way of building the boxes is the dumbest possible. If you want something that really fits, you'll have to look on building the convex hull of each cluster.

Finding best fit boxes of a scatter plot using python?

1 Answers1