5

I want to cluster a huge amount of data records. The data that I'm dealing with are of string type. The clustering process takes a long time.
Let us assume that I want to cluster a set of email data records into cluster, where emails written by the same person are allocated to the same cluster (taking into account that a person might write his/her name in different ways).
I want to perform a multi stage clustering:

  • First stage clustering based on name, if the name distance between two records is less than a threshold we consider these clusters otherwise...
  • The data records enters the second stage of clustering based on other attributes (other than name).

The pairwise distance is calculated. Now I'm in the clustering phase. I want to use the following code for dbscan clustering:

function [IDX, isnoise] = dbscan_strings(X,epsilon,MinPts)
C = 0;
n = size(X,1); 
IDX = zeros(n,1);
D = pdist2(X,X,@intersection);
visited = false(n,1);
isnoise = false(n,1);
for i = 1:n
    if ~visited(i)
        visited(i) = true;
        Neighbors = RegionQuery(i);
        if numel(Neighbors)<MinPts
            % X(i,:) is NOISE
            isnoise(i) = true;
        else
            C = C+1;
            ExpandCluster(i,Neighbors,C);
        end
    end
end

function ExpandCluster(i,Neighbors,C)
    IDX(i) = C;
    k = 1;
    while true
        j = Neighbors(k);
        if ~visited(j)
            visited(j) = true;
            Neighbors2 = RegionQuery(j);
            if numel(Neighbors2)>=MinPts
                Neighbors = [Neighbors Neighbors2];   %#ok
            end
        end
        if IDX(j)==0
            IDX(j) = C;
        end
        k = k + 1;
        if k > numel(Neighbors)
            break;
        end
    end
end

function Neighbors = RegionQuery(i)
    Neighbors = find(D(i,:)<=epsilon);
end
end

I need help in making the following clustering process into a multistage process where X contains data records with all attributes. Let us assume that X{:,1} is the data records with the name attribute, since the name is contained in the first column.

NOTE: I will give a bounty of 50 points for the one who helps me.

EBH
  • 10,350
  • 3
  • 34
  • 59
s.e
  • 271
  • 2
  • 15
  • 2
    StackOverflow is not really a code-writing service, although you may find people willing to help (especially with a bounty). In order to make this a good programming question, you should demonstrate your own effort and point out where the problem is that you can't solve. This could be some conceptual ideas for which you lack the programming skills, or an implementation that you tried but failed for a reason that you can't solve. – tvo Sep 09 '16 at 14:35
  • A second piece of advice. I did not dive into trying to understand that you already have, because it lacks documentation and comments. You have a number of functions with a bunch of variables in that piece of code, which I can't tell what they mean on first sight. Try to add some overall clues to what your code is doing and what the variables mean. This will help for several reasons: you will understand your code better now, but also recall how it works in the future; others (programmers or users) will understand it better and could contribute more easily; – tvo Sep 09 '16 at 14:44
  • I'm not asking for code writing. My code is already written. I'm asking for help in general. I'm asking for ideas? Which lines of my code should I modify? something like this. – s.e Sep 09 '16 at 17:30

1 Answers1

3

Don't do everything at once!

You are computing a lot of things that you never need, which makes things slow. For example, a good DBSCAN does not use a distance function, but an index.

For the names, only work on unique names! You supposedly have many exact same names, but you end up computing the same distances again and again.

So first of all, build a set of unique names only. Perform your similarity matching on this (I would however suggest to use OpenRefine for this rather than Matlab!). Once you have identified names to merge, build a new data matrix for every name group. Then run whatever clustering you want. Good candidates are probably HDBSCAN, and OPTICSXi (have a look at the clustering algorithms available in ELKI, which probably has the widest selection to choose from). Maybe start only with an average common name, to get a feeling of the parameters for the algorithm. Don't cluster all subsets at once.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194