I want to cluster a huge amount of data records. The data that I'm dealing with are of string type. The clustering process takes a long time.
Let us assume that I want to cluster a set of email data records into cluster, where emails written by the same person are allocated to the same cluster (taking into account that a person might write his/her name in different ways).
I want to perform a multi stage clustering:
- First stage clustering based on name, if the name distance between two records is less than a threshold we consider these clusters otherwise...
- The data records enters the second stage of clustering based on other attributes (other than name).
The pairwise distance is calculated. Now I'm in the clustering phase. I want to use the following code for dbscan
clustering:
function [IDX, isnoise] = dbscan_strings(X,epsilon,MinPts)
C = 0;
n = size(X,1);
IDX = zeros(n,1);
D = pdist2(X,X,@intersection);
visited = false(n,1);
isnoise = false(n,1);
for i = 1:n
if ~visited(i)
visited(i) = true;
Neighbors = RegionQuery(i);
if numel(Neighbors)<MinPts
% X(i,:) is NOISE
isnoise(i) = true;
else
C = C+1;
ExpandCluster(i,Neighbors,C);
end
end
end
function ExpandCluster(i,Neighbors,C)
IDX(i) = C;
k = 1;
while true
j = Neighbors(k);
if ~visited(j)
visited(j) = true;
Neighbors2 = RegionQuery(j);
if numel(Neighbors2)>=MinPts
Neighbors = [Neighbors Neighbors2]; %#ok
end
end
if IDX(j)==0
IDX(j) = C;
end
k = k + 1;
if k > numel(Neighbors)
break;
end
end
end
function Neighbors = RegionQuery(i)
Neighbors = find(D(i,:)<=epsilon);
end
end
I need help in making the following clustering process into a multistage process where X
contains data records with all attributes. Let us assume that X{:,1}
is the data records with the name attribute, since the name is contained in the first column.
NOTE: I will give a bounty of 50 points for the one who helps me.