I am building a document classifier to categorize documents.
So first step is to represent each documents as "features vector" for the training purpose.
After some research, I found that I can use either the Bag of Words approach or N-gram approach to represent a document as a vector.
The text in each document (scanned pdfs and images) is retrieved using an OCR, thus some words contain errors. And I don't have previous knowledge about the language used in these documents (can't use stemming).
So as far as I understand I have to use the n-gram approach. or are there other approaches to represent a document ?
I would also appreciate if someone could link me to an N-Gram guide in order to have a clearer picture and understand how it works.
Thanks in Advance