I need to autogenerate categories of a publication using its abstract and support synonyms. I have classification data of 800-900 articles which I can use for training. This classification data is generated by the pharma experts by reading a unstructured publication.
Existing classification categories are like below for existing publications:
- Drug : Some drug, Some other drug.
- Diseases : Some Disease.
- Authors : Some authors and so on..
These categories are currently generated by Human expert. I explored Natural library in node.js and lingpipe in Java. It has classifiers but I am not able to figure out what is the most efficient way to train it, so that I get 90% accuracy.
Following are approaches in my mind :
I can pass entire abstracts of publication one by one and tell it its categories like below?
var natural = require('natural'); var classifier = new natural.BayesClassifier(); classifier.addDocument('This article is for parcetamol written by Techgyani. Article was written in 2012', 'year:2012'); classifier.addDocument('This article is for parcetamol written by Techgyani. Article was written in 2012', 'author:techgyani'); classifier.train();
I can pass it sentence one by one and tell it what is its category which will be manual and timeconsuming process. So that when I pass it entire abstract, it will autogenerate set of categories for me like below :
var natural = require('natural'); var classifier = new natural.BayesClassifier(); classifier.addDocument('This article is for parcetamol written by Techgyani', 'drug:Paracetamol'); classifier.addDocument('This article is for parcetamol written by Techgyani', 'author:techgyani'); classifier.addDocument('Article was written in 2012', 'year:2012'); classifier.train();
I can also extract tokens from the publication and search my database and figure categories on my own without any use of NLP/ML libraries.
According to your experience which is the most efficient way to solve this problem? I am open for solution in any language but I prefer Javascript because existing stack is in Javascript.