2

I am building a document classifier to categorize documents.

So first step is to represent each documents as "features vector" for the training purpose.

After some research, I found that I can use either the Bag of Words approach or N-gram approach to represent a document as a vector.

The text in each document (scanned pdfs and images) is retrieved using an OCR, thus some words contain errors. And I don't have previous knowledge about the language used in these documents (can't use stemming).

So as far as I understand I have to use the n-gram approach. or are there other approaches to represent a document ?

I would also appreciate if someone could link me to an N-Gram guide in order to have a clearer picture and understand how it works.

Thanks in Advance

TeFa
  • 974
  • 4
  • 15
  • 37

1 Answers1

9
  1. Use language detection to get document's language (my favorite tool is LanguageIdentifier from Tika project, but many others are available).
  2. Use spell correction (see this question for some details).
  3. Stem words (if you work in Java environment, Lucene is your choice).
  4. Collect all N-grams (see below).
  5. Make instances for classification by extracting n-grams from particular documents.
  6. Build classifier.

N-gram models

N-grams are just sequences of N items. In classification by topic you normally use N-grams of words or their roots (though there are models based on N-grams of chars). Most popular N-grams are unigrams (just word), bigrams (2 serial words) and trigrams (3 serial words). So, from sentence

Hello, my name is Frank

you should get following unigrams:

[hello, my, name, is, frank] (or [hello, I, name, be, frank], if you use roots)

following bigrams:

[hello_my, my_name, name_is, is_frank]

and so on.

At the end your feature vector should have as much positions (dimensions) as there are words in all your text plus 1 for unknown words. Every position in instance vector should somehow reflect number of corresponding words in instance text. This may be number of occurrences, binary feature (1 if word occurs, 0 otherwise), normalized feature or tf-idf (very popular in classification by topic).

Classification process itself is the same as for any other domain.

Community
  • 1
  • 1
ffriend
  • 27,562
  • 13
  • 91
  • 132
  • @ffriend, sorry but I am confused about something here ... whats the difference between Snowball Analyzer and Lucene Analyzer ? cuz I downloaded the Lucene core libraries, but it doesn't include any Snowball Analyzer !! – TeFa Aug 22 '12 at 13:38
  • @TeFa, Lucene is the name of the whole library, so any analyzer in this library is "Lucene analyzer". SnowballAnalyzer is one popular analyzer that may be configured for different languages by passing language string to constructor. You can find this analyzer in JARs with name a-la "lucene-snowball-3.1.1.jar" or similar. However, at the moment SnowballAnalyzer is [considered as depricated](http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/snowball/SnowballAnalyzer.html) and using language specific analyzers from module/analysis instead is suggested. – ffriend Aug 22 '12 at 20:28
  • Also see [this](http://stackoverflow.com/questions/5483903/comparison-of-lucene-analyzers/5484488#5484488) question to get the idea of how analyzers work and what they consist of. – ffriend Aug 22 '12 at 20:29
  • Yea I just noticed that the SnowballAnalyzer is deprecated and that I should use the analyzer in contrib/analyzers instead. Language detection and word stemming is working now for me :) ... I also found a really good language detection library http://code.google.com/p/language-detection/ cause the one you suggested gave me incorrect results, not sure if I was doing something wrong or what exactly. but thanks alot. I couldn't be able to go this far w/o ur help ... – TeFa Aug 22 '12 at 22:57