4

I have million documents which belongs to different classes (100 classes). I want to find outlier documents in each class (which doesn't belong to that class but wrongly classified) and filter them. I can do document similarity using cosine similarity by comparing the tokens of each document. I am not able to apply this to filter the wrongly classified documents for a given class. Example: Consider the 3 classes for simplicity with the documents under them.

ClassA  ClassB  ClassC ... 
doc1    doc2    doc3 
doc4    doc5    doc6 
doc7    doc8    doc9 

How can I figure out effectively and efficiently that doc4(and other similar docs) is wrongly classified in ClassA, so that my training data does not contain outliers?

Gaurav Chawla
  • 1,473
  • 3
  • 14
  • 19

4 Answers4

3

It is a hard problem in unsupervised learning. It is usually called topic modelling. You can start by running LDA (Latent Dirichlet Allocation) algorithm. I suggest using gensim package for that. Don't run it on all your data, take 20-50 thousands documents at the beginning. After you have initial classifier, out of millions of documents you have select only those that were classified as belonging to some class with the probability above certain threshold. Train LDA again on those. This should give you classes that are better separated. Reclassify your data.

The LDA algorithm classifies document in a "soft" way, so each document has a certain probability to belong to each of your 100 classes. But usually, those that have high probability of belonging to many classes at the same time are badly classified.

You can do all that without involving human labelers.

igrinis
  • 12,398
  • 20
  • 45
3

Since you have labels for the 100 classes, this is in principle a fairly standard outlier detection problem, and you need to find documents that do not resemble most of the documents that carry the same label.

As you suggest, you can use cosine similarity (on word counts, I assume) to score the similarity of pairs of documents. There are many practical issues involved with cosine similarity such as selection of important words, stemming, stop words and so on, and you may also wish to consider word similarity, via soft cosine similarity.

It would be impractical to calculate all cosine similarities for such a large corpus so you will need to summarise each class somehow. A simple method would be to average the word word counts for each document type and measure the similarity between this model document and each of the members in the class, so to score each document you only need to calculate a single cosine similarity. You should reject some chosen percentile of documents as potentially misclassified, with a threshold comparable to the percentage of misclassified documents you expect. Obviously a higher threshold will eliminate more errors, but also more correctly classified documents.

A better implementation might be to apply a fast clustering algorithm separately to each of the 100 kinds of documents. The average word count within each cluster would give you a handful of model documents for each label, and you should use the highest similarity as the score for each document.

rmac
  • 101
  • 3
1

In this problem you can try something like Mahalanobis distance.

(I can't add equations as images here because I don't have enough reputation - please check the equation - I will try to explain the idea below)

Essentially, Mahalanobis distance tries to find distance of a point P from distribution D.

So in your case:

  1. Get distribution D of a class

    First we need a vector representation of each document in the class. Now this can be done in various ways- in most basic way, we can get a vector representation of each doc based on it's tf-idf. Then we can calculate mean and co-variance S for the class using these vectors.

    More sophisticated approach will to get vector representation of each document by passing it through some document representation model like Doc2Vec - read more about it here.

  2. Calculate Mahalanobis distance for each doc in the class

    Take each doc's vector representation and calculate it's distance from D using the formula. You would have to set some threshold which can be decided by checking few examples that you are quite certain about being an outlier.

    Even this can be automated to some extent- you can calculate distances for each doc and get a distribution from those distances (like Weibull distribution).

I must add, the effectiveness of this method depends on proportion of outliers in the class - more the outliers, more they will effect the distribution D.

moulin
  • 29
  • 9
0

First, the usual term for what you are describing is 'classification'. Clustering is a form of unsupervised modelling, where no class labels are known. So your task is to classify the unlabelled documents into one of a set of known classes, or to to an "unknown" class for outliers.

Second, you have many options for "outlier detection"! One is as you suggest: classify the documents and define as an outlier anything that is distant from the nearest class (e.g. using standard deviations). Or if you use a probabilistic classifier, such as naive Bayes, you could then define outliers as documents with a very low maximum likelihood.

The best method, and the choice of thresholds, depends very much on the details of your data, so you will need to try out several approaches and see what works best.