I am working on a text classification problem, I am trying to classify a collection of words into category, yes there are plenty of libraries available for classification, so please dont answer if you are suggesting to use them.
Let me explain what I want to implement. ( take for example )
List of Words:
- java
- programming
- language
- c-sharp
List of Categories.
- java
- c-sharp
here we will train the set, as:
- java maps to category 1. java
- programming maps to category 1.java
- programming maps to category 2.c-sharp
- language maps to category 1.java
- language maps to category 2.c-sharp
- c-sharp maps to category 2.c-sharp
Now we have a phrase "The best java programming book" from the given phrase following words are a match to our "List of Words.":
- java
- programming
"programming" has two mapped categories "java" & "c-sharp" so it is a common word.
"java" is mapped to category "java" only.
So our matching category for the phrase is "java"
This is what came to my mind, is this solution fine, can it be implemented, what are your suggestions, any thing I am missing out, flaws, etc..