How do I classify data points without label information?

Question

Based on the previous question from here

I have another question. classification accuracy?

`C` and `D` matrices should be composed of elements from matrices `A` and `B`, I assume. Otherwise, there would be no information for classifying them. — NKN, Jul 29 '15 at 20:25
You can't determine classification accuracy if you don't know what the label information of `C` and `D` was. Before I answer your question, please verify if really want to find classification accuracy, because you can't if you don't know the labels *a priori*. — rayryeng, Jul 29 '15 at 20:28
@angela baby it is better to provide a simple example in your code. — NKN, Jul 29 '15 at 20:29

score 1 · Accepted Answer · edited May 23 '17 at 10:31

I'll address your last point first to get it out of the way. If you don't know what the classification labels were to begin with, then there's no way to assess classification accuracy. How do you know whether the correct label was assigned to the point in C or D if you don't know what label it was to begin with? In that case, we're going to have to leave that alone.

However, what you could do is calculate the percentage of what values get classified as A or B in the matrices C and D to get a sense of the distribution of samples in them both. Specifically, if for example in matrix C, the majority of samples get classified to belong to the group defined by matrix A, then that is probably a good indication that a C is very much like A in distribution.

In any case, one thing I can suggest for you to classify which points in C or D belong to either A or B is to use the k-nearest neighbours algorithm. Concretely, you have a bunch of source data points, namely those that belong in matrices A and B, where A and B have their own labels. In your case, samples in A are assigned a label of 1 and samples in B are assigned a label of -1. To determine where an unknown point belongs to for a group, you can simply find the distance between this point in feature space with all values in A and B. Whichever point in A or B that is the closest with the unknown point, then whatever group that point belonged to in your source points, that's the group you would apply to this unknown point.

As such, simply concatenate C and D into a single N x 1000 matrix, apply k-nearest neighbour to another concatenated matrix with A and B and figure out which point it's the closest to in this other concatenated matrix. Then, read off what the label was and that'll give you what the label of the unknown point can possibly be.

In MATLAB, use the knnsearch function that's part of the Statistics Toolbox. However, I encourage you to take a look at my previous post on explaining the k-nearest neighbour algorithm here: Finding K-nearest neighbors and its implementation

In any case, here's how you'd apply what I said above with your problem statement, assuming A, B, C and D are already defined:

labels = [ones(size(A,1),1); -ones(size(B,1),1)]; %// Create labels for A and B

%// Create source and query points
sourcePoints = [A; B];
queryPoints = [C; D];

%// Perform knnsearch
IDX = knnsearch(sourcePoints, queryPoints);

%// Extract out the groups per point
groups = labels(IDX);

groups will contain the labels associated with each of the points provided by queryPoints. knnsearch returns the row location of the source point in sourcePoints that best matched with a query point. As such, each value of the output tells you which point in the source point matrix best matched with that particular query point. Ultimately, this returns the location we need in the labels array to figure out what the actual labels are.

Therefore, if you want to see what labels were assigned to the points in C, you can do:

labelsC = groups(1:size(C,1));
labelsD = groups(size(C,1)+1:end);

Therefore, in labelsC and labelsD, they contain the labels assigned for each of the unknown points in both matrices. Any values that are 1 meant that the particular points resembled those from matrix A. Similarly, any values that are -1 meant that the particular points resembled those from matrix B.

If you want to plot all of this together, just combine what you did in the previous question with your new data from this question:

%// Code as before
[coeffA, scoreA] = pca(A);
[coeffB, scoreB] = pca(B);
numDimensions = 2;
scoreAred = scoreA(:,1:numDimensions);
scoreBred = scoreB(:,1:numDimensions);

%// New - Perform dimensionality reduction on C and D
[coeffC, scoreC] = pca(C);
[coeffD, scoreD] = pca(D);
scoreCred = scoreC(:,1:numDimensions);
scoreDred = scoreD(:,1:numDimensions);


%// Plot the data
plot(scoreAred(:,1), scoreAred(:,2), 'rx', scoreBred(:,1), scoreBred(:,2), 'bo');
hold on;
plot(scoreCred(labelsC == 1,1), scoreCred(labelsC == 1,2), 'gx',  ...
     scoreCred(labelsC == -1,1), scoreCred(labelsC == -1,2), 'mo'); 
plot(scoreDred(labelsD == 1,1), scoreDred(labelsD == 1,2), 'kx',  ...
     scoreDred(labelsD == -1,1), scoreDred(labelsD == -1,2), 'co');

The above is the case for two dimensions. We plot both A and B with their dimensions reduced to 2. Similarly, we apply PCA to C and D, then plot everything together. The first line plots A and B normally. Next, we have to use hold on; so we can invoke plot multiple times and append results to the same figure. We have to call plot four times to account for four different combinations:

Matrix C having labels from A
Matrix C having labels from B
Matrix D having labels from A
Matrix D having labels from B

Each case I have placed a different colour, but used the same marker to denote which class each point belongs to: x for group A and o for group B.

I'll leave it to you to extend this to three dimensions.

How do I classify data points without label information?

1 Answers1