1

I have a number of scattered data sets in Nx3 matrices, a simple example plotted with scatter3 is shown below (pastebin of the raw values): Scattered data with obvious regions Each of my data sets have an arbitrary number of regions/blobs; the example above for instance has 4.

Does anyone know of a simple method to programmatically find the number of regions in this form of data?

My initial idea was to use a delaunayTriangulation, convexHull approach, but without any data treatment this will still only find the outer volume of the entire plot rather than each region.

The next idea I have would involve grabbing nearest neighbour statistics of each point, asking if it's within a grid size distance of another point, then lumping those that are in to separate blobs/clusters.

Is there a higher level Matlab function I'm not aware of that could assist me here, or does anyone have a better suggestion of how to pull the region count out of data like this?

Geodesic
  • 873
  • 8
  • 19

4 Answers4

1

It sounds like you need a clustering algorithm. Fortunately for you, MATLAB provides a number of these out of the box. There are plenty of algorithms to choose from, and it sounds like you need something where the number of clusters is unknown beforehand, correct?

If this is the case, and your data is as "nice" as your example I would suggest kmeans combined with a technique to properly choose "k", as suggested here.

There are other options of course, I recommend you learn more about the clustering options in MATLAB, here's a nice reference for more reading.

Community
  • 1
  • 1
MarkV
  • 1,030
  • 7
  • 15
  • Thanks! Good suggestions. I'll take a look and see how this goes. Will report back soon. – Geodesic Jun 16 '13 at 04:55
  • kmeans caused a few problems, but your links help me out greatly. I have a working solution using Hierarchical clustering using distance as the criterion; following your third link. Many thanks. – Geodesic Jun 16 '13 at 08:20
0

Determining the number of different clusters in a dataset is a tricky problem and probably hard than what we might thought on first sight. In fact, algorithms like k-means are heavily dependent on this. Wikipedia has a nice article on it, but no clear and easy method.

The Elbow method as mentioned in there seems to be comparatively easy to do, although might be computationally costly. In essence, you could just try to use different number of clusters, and choose the number where the variance explained doesn't grows much and plateaued out.

Also, the notion of a cluster need to be clearly defined - what if zooming into any of the blobs displays a similar structure as the corner structure in the picture?

zw324
  • 26,764
  • 16
  • 85
  • 118
  • Thanks, I'll have a read through that. In answer to your second question: that issue would be limited via the grid size of my data, so there wouldn't be any visible substructure under those bounds. – Geodesic Jun 16 '13 at 04:49
  • Actually, I misread what you meant concerning your question. Valid point, although I understand my data enough to know that won't be an issue in this particular case; but certainly something to consider in general. – Geodesic Jun 16 '13 at 04:58
0

I would suggest implementing a "light" version of the Gaussian Mixture Model. Have each point "vote" for a cube. In the above example, all points centered around (-1.5,-1.5,0) would each add a +1 to the square [-1,-2]x[-1,-2]x[0.2,-0.2]. Finally, you can analyze the peaks in the voting matrix.

Leeor
  • 627
  • 7
  • 24
0

In the interest of completeness there's a vastly simpler answer to this problem (which I have built) than Hierarchical clustering; which gives much better results and can differentiate between 1 cluster or 2 (an issue that I couldn't manage to fix with MarkV's suggestions). This assumes your data is on a regular grid of known size, and you have an unknown amount of clusters that are separated by at least 2*(grid size):

% Idea is as follows: 
% * We have a known grid size, dx. 
% * A random point [may as well be minima(1,:)] will be in a cluster of 
%   values if any others in the list lie dx away (with one dimention 
%   varied), sqrt(2)*dx (two dimensions varied) or sqrt(3)*dx (three 
%   dimensions varied).
% * Chain these objects together until all are found, any with distances
%   beyond sqrt(3)*dx of the cluster are ignored for now.
% * Set this cluster aside, repeat until no minima data is left.

function [blobs, clusterIdx] = findClusters(minima,dx)
%problem setup
dx2 = sqrt(2)*dx;
dx3 = sqrt(3)*dx;
eqf = @(list,dx,dx2,dx3)(abs(list-dx) < 0.001 | abs(list-dx2) < 0.001 | abs(list-dx3) < 0.001);
notDoneClust = true;
notDoneMinima = true;

clusterIdx = zeros(size(minima,1),1);
point = minima(1,:);
list = minima(2:end,:);
blobs = 0;

while notDoneMinima
    cluster = nan(1,3);

    while notDoneClust
        [~, dist] = knnsearch(point,list); %All distances to each other point in data
        nnidx = eqf(dist,dx,dx2,dx3); %finds indexes of nn values to point

        cluster = cat(1,cluster,point,list(nnidx,:)); %add points to current cluster
        point = list(nnidx,:); %points to check are now all values that are nn to initial point
        list = list(~nnidx,:); %list is now all other values that are not in that list

        notDoneClust = ~isempty(point); %if there are no more points to check, all values of the cluster have been found
    end
    blobs = blobs+1;
    clusterIdx(ismemberf(minima,cluster(2:end,:),'rows')) = blobs;

    %reset points and list for a new cluster
    if ~isempty(list)
        if length(list)>1
            point = list(1,:);
            list = list(2:end,:);
            notDoneClust = true;
        else
            %point is a cluster of its own. Don't reset loops, add point in
            %as a cluster and exit (NOTE: I have yet to test this portion).
            blobs = blobs+1;
            clusterIdx(ismemberf(minima,point,'rows')) = blobs;
            notDoneMinima = false;
        end
    else
        notDoneMinima = false;
    end
end
end

I fully understand this method is useless for clustering data in the general sense, as any outlying data will be marked as a separate cluster. This (if it happens) is what I need anyway, so this may just be an edge case scenario.

Geodesic
  • 873
  • 8
  • 19