Abstract Classification using NLP/ML

Question

I need to autogenerate categories of a publication using its abstract and support synonyms. I have classification data of 800-900 articles which I can use for training. This classification data is generated by the pharma experts by reading a unstructured publication.

Existing classification categories are like below for existing publications:

Drug : Some drug, Some other drug.
Diseases : Some Disease.
Authors : Some authors and so on..

These categories are currently generated by Human expert. I explored Natural library in node.js and lingpipe in Java. It has classifiers but I am not able to figure out what is the most efficient way to train it, so that I get 90% accuracy.

Following are approaches in my mind :

I can pass entire abstracts of publication one by one and tell it its categories like below?

var natural = require('natural');
var classifier = new natural.BayesClassifier();
classifier.addDocument('This article is for parcetamol written by Techgyani. Article was written in 2012', 'year:2012');
classifier.addDocument('This article is for parcetamol written by Techgyani. Article was written in 2012', 'author:techgyani');
classifier.train();

I can pass it sentence one by one and tell it what is its category which will be manual and timeconsuming process. So that when I pass it entire abstract, it will autogenerate set of categories for me like below :

var natural = require('natural');
var classifier = new natural.BayesClassifier();
classifier.addDocument('This article is for parcetamol written by Techgyani', 'drug:Paracetamol');
classifier.addDocument('This article is for parcetamol written by Techgyani', 'author:techgyani');
classifier.addDocument('Article was written in 2012', 'year:2012');
classifier.train();

I can also extract tokens from the publication and search my database and figure categories on my own without any use of NLP/ML libraries.

According to your experience which is the most efficient way to solve this problem? I am open for solution in any language but I prefer Javascript because existing stack is in Javascript.

score 0 · Answer 1 · answered Jan 30 '18 at 05:41

I'd recommend using either most frequent words or word frequency as features in a naive bayes classifier.

No need to tag sentences individually. I'd expect reasonable accuracy at the document level, although that will depend on the nature of your documents trained and classified.

Great discussion on Python implementation below

Implementing Bag-of-Words Naive-Bayes classifier in NLTK

score 0 · Answer 2 · answered Jan 30 '18 at 05:45

According to me, second solution of yours will work like a charm. You need to train your classifier in order to do your work.

You need to pass classifier.train(data, labels);. I know this will be a manual work, but it will hardly take some time to train your classifier.

Once it is trained, you can very well pass one of your sentence and see for the output by yourself

score 0 · Answer 3 · answered Jan 31 '18 at 19:30

You should explore off the shelf Named Entity Recognition models first before investing in training. Spacy is written in Python but has a javascript binding. The classifier in natural use naive bayes and logistic regression and will not have as good a performance as a neural network library like Spacy. I suspect that natural will not work well for new cases where it has not already not seen the drug, disease, or author name in the training set.

Abstract Classification using NLP/ML

3 Answers3