Text Classification using Spark ML

Question

I have a free text description based on which I need to perform a classification. For example the description can be that of an incident. Based on the description of the incident , I need to predict the risk associated with the event . For eg : "A murder in town" - this description is a candidate for "high" risk.

I tried logistic regression but realized that currently there is support only for binary classification. For Multi class classification ( there are only three possible values ) based on free text description , what would be the most suitable algorithm? ( Linear Regression or Naive Bayes )

score 2 · Answer 1 · edited May 23 '17 at 12:17

Since you are using spark, I assume you have bigdata, so -I am no expert- but after reading your answer, I would like to make some points.

Create the Training (80%) and Testing Data Sets (20%)

I would partition my data to Training (60-70%), Testing (15-20%) and Evaluation (15-20%) sets..

The idea is that you can fine tune your classification algorithm w.r.t. the Training set, but we really want to do with with Classification tasks, is to have them classify unseen data. So fine tune your algorithm with the Testing set, and when you are done, use the Evaluation set, to get a real understanding of how things work!

Stop words

If your data are articles from Newspapers and such,I personally haven't seen any significant improvement by using more sophisticated stop words removal approaches...

But that's just a personal statement, but if I were you, I wouldn't focus on that step.

Term Frequency

How about using Term Frequency-Inverse Document Frequency (TF-IDF) term weighting instead? You may want to read: How can I create a TF-IDF for Text Classification using Spark?

I would try both and compare!

Multinomial

Do you have any particular reason to try the Multinomial Distribution? If no, since when n is 1 and k is 2 the multinomial distribution is the Bernoulli distribution, as stated in Wikipedia, which is supported.

Try both and compare ( this is something you have to get used to, if you wish to make your model better! :) )

I also see that apache-spark-mllib offers Random forests, which might worth a read, at least! ;)

If your data is not that big, I would also try Support vector machines (SVMs), from scikit-learn, which however supports python, so you should switch to pyspark or plain python, abandoning spark. BTW, if you are actually going for sklearn, this might come in handy: How to split into train, test and evaluation sets in sklearn?, since Pandas plays nicely along with sklearn.

Hope this helps!

Off-topic:

This is really not the way to ask a question in Stack Overflow. Read How to ask a good question?

Personally, if I were you, I would do all the things you have done in your answer first, and then post a question, summarizing my approach.

As for the bounty, you may want to read: How does the Bounty System work?

Thanks gsamaras. I will follow the suggestions as you have mentioned — lives, Sep 17 '16 at 08:59
@lives great! I also updated my answer, since I am doing SVMs now, and I feel you pretty much! — gsamaras, Sep 17 '16 at 17:54

score 1 · Accepted Answer · edited Sep 11 '16 at 21:00

This is how I solved the above problem.

Though prediction accuracy is not bad ,the model has to be tuned further for better results.

Experts , please revert back if you find anything wrong.

My input data frame has two columns "Text" and "RiskClassification"

Below are the sequence of steps to predict using Naive Bayes in Java

Add a new column "label" to the input dataframe . This column will basically decode the risk classification like below

sqlContext.udf().register("myUDF", new UDF1<String, Integer>() {
            @Override
            public Integer call(String input) throws Exception {
                if ("LOW".equals(input))
                    return 1;
                if ("MEDIUM".equals(input))
                    return 2;
                if ("HIGH".equals(input))
                    return 3;
                return 0;
            }
        }, DataTypes.IntegerType);

samplingData = samplingData.withColumn("label", functions.callUDF("myUDF", samplingData.col("riskClassification")));

Create the Training ( 80 % ) and Testing Data Sets ( 20 % )

For eg :

DataFrame lowRisk = samplingData.filter(samplingData.col("label").equalTo(1));
DataFrame lowRiskTraining = lowRisk.sample(false, 0.8);

Union All the dataframes to build the complete training data
Building test data is slightly tricky . Test Data should have all data which is not present in the training data
Start transformation of training data and build the model

6 . Tokenize the text column in the training data set

Tokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words");
DataFrame tokenized = tokenizer.transform(trainingRiskData);

Remove Stop Words. (Here you can also do advanced operations like lemme, stemmer, POS etc using Stanford NLP library)

StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");
DataFrame stopWordsRemoved = remover.transform(tokenized);

Compute Term Frequency using HashingTF. CountVectorizer is another way to do this

int numFeatures = 20;
HashingTF hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
        .setNumFeatures(numFeatures);
DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);

IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(rawFeaturizedData);

DataFrame featurizedData = idfModel.transform(rawFeaturizedData);

Convert the featurized input into JavaRDD . Naive Bayes works on LabeledPoint

JavaRDD<LabeledPoint> labelledJavaRDD = featurizedData.select("label", "features").toJavaRDD()
    .map(new Function<Row, LabeledPoint>() {

        @Override
        public LabeledPoint call(Row arg0) throws Exception {
            LabeledPoint labeledPoint = new LabeledPoint(new Double(arg0.get(0).toString()),
                    (org.apache.spark.mllib.linalg.Vector) arg0.get(1));
            return labeledPoint;
        }
    });

Build the model

NaiveBayes naiveBayes = new NaiveBayes(1.0, "multinomial");
NaiveBayesModel naiveBayesModel = naiveBayes.train(labelledJavaRDD.rdd(), 1.0);

Run all the above transformations on the test data also
Loop through the test data frame and perform the below actions
Create a LabeledPoint using the "label" and "features" in the test data frame

For eg : If the test data frame has label and features in the third and seventh column , then

LabeledPoint labeledPoint = new LabeledPoint(new Double(dataFrameRow.get(3).toString()),
(org.apache.spark.mllib.linalg.Vector) dataFrameRow.get(7));

Use the Prediction Model to predict the label

double predictedLabel = naiveBayesModel.predict(labeledPoint.features());

Add the predicted label also as a column to the test data frame.
Now test data frame has the expected label and the predicted label.
You can export the test data to csv and do analysis or you can compute the accuracy programatically as well.

Text Classification using Spark ML

2 Answers2