1

I have the following scenario to process and then classify a natural language as the following:-

Initially, I have an algorithm Alg1 which can classify some data/text according to some matrix-scores , I can build some feature-matrix which is scored by somehow like:

POS | Modal verbs | sentence length | special words (if a sentence has a special word -> score=1) | special verbs (if a sentence has one or more of special verbs) | conditions (while,if,then))


then according to these matrix-scores: I can initially classify some sentences into different classes{class1,class2,class3}, just using [if - then] statements, so the question now how can I merge (normalize) this approach to be used with help of a text classifier algorithm such as (SVM) or whatever, in order to get a better precision-recall) what is the idea to implement such mixed-approach. ?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • What is it about the problem you're addressing that makes a more traditional Bag of Words style text classification inappropriate? – Bob Dillon Sep 10 '15 at 14:35
  • @Bob Dillon Excuse me; can you please explain what you mean by a more traditional Bag of Words? is this a question or answer? sorry I did not understand you.. – Fawzi Belal Sep 10 '15 at 15:31
  • @Bob Dillon why you think it is not appropriate ? – Fawzi Belal Sep 10 '15 at 17:10
  • Bag of Words is a text model that ignores the order of words (N-Grams more specifically) in a piece of text and instead simply looks at the list of words in the text, and often the frequency of each word. In Machine Learning and Text Classification you can use the BoW representation to do accurate text classification. With BoW you translate a random piece of text into a vector that can be fed to an SVM model for classification. This is, of course, a supervised training case so you need a corpus of training/testing text to create the model for each class. – Bob Dillon Sep 10 '15 at 18:51
  • If this is in the ball-park of what you want to do I can write it up in a bit more detail as an answer. I just want to see if I'm in the ball park of what you're trying to accomplish. – Bob Dillon Sep 10 '15 at 18:52
  • Yes @Bob Dillon please give me the details please – Fawzi Belal Sep 10 '15 at 21:02

1 Answers1

1

Bag-of-Words (BoW) is a text model where word order, grammar, syntax, etc. are ignored and just the presence of words is considered. It's as if you took the words from a piece of text and plopped them into a bag and shook it up scrabble style. This is also called the Naive Bayes assumption because it looks at words probabilities irrespective of order; but I find this confusing given the Naive Bayes machine learning model. These are different things that share a name. The BoW model is used in text classification and information retrieval application to name a couple.

In the most general case, we start with a training corpus of positive (of the class you're looking for) and negative documents (not of the class) for training. The corpus is examined and every unique word (symbol) in the corpus is identified. This symbol list is called the feature set. Using the feature set, a vector is generated that represents each document in the training corpus. The vector consists of either a binary value (feature present/not in the document) or a number (frequency of the feature in the training set). These vectors are a BoW representation of the corpus and can be used to train a model such a an SVM model. Once a model is trained, vectors can be generated from documents "in the wild" and the model can be used to classify the document as being representative of the positive or negative class with a particular likelihood.

With any substantial corpus it's typical to have 10's of thousands, 100's of thousands and even millions of unique symbols. To get high classification performance, a process known as Dimensionality Reduction or Feature Reduction is performed. Feature reduction seeks to eliminate the symbols that are least effective at classifying; leaving only the most relevant features to consider. As an example, the word "the" will appear in almost all text and so is of no value in separating documents into classes. The word "football" would be of high value in sorting documents into those related to sports and those not related to sports. Dimensionality Reduction is a deep subject all by itself. Here's another Stack question where it's addressed in a bit of detail.

There are other variations such as the use of N-Grams (N consecutive words as a single symbol). Google Bag of Words Text Classification and you will find many academic papers, blog posts, books, etc. that describe this technique in much greater detail and explore the many aspects of optimizing performance for a variety of applications. There are also many tools for most any language that simplify the implementation of a BoW text classifier. Google your language of choice and bag of words. I hope this gets you started.

Community
  • 1
  • 1
Bob Dillon
  • 341
  • 1
  • 7