How to train Logistic Regression with apache spark ML with two column of text as features?

Question

I am trying to train a Logistic regression model with Apache Spark. My dataframe looks like this.

StructType schema = new StructType(new StructField[]{
    new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
    new StructField("text_A", DataTypes.StringType, false, Metadata.empty()),
    new StructField("text_B", DataTypes.StringType, false, Metadata.empty()),
});

Dataset<Row> trainingDataFrame =spark.createDataFrame(trainingdata, schema);

I want to use both text_A and text_B as features to train the model, yet I don't want to just concatenate both of them. I want to make them separate category of features (So if the same word shows up in text_A and text_B, they are considered as different feature). In the current lr class, it is using features column as the default and only features to train the model. Is it possible to use two different column as the training features? Or how can I merge these two text features into a single features column for training?

score 0 · Answer 1 · answered Oct 01 '18 at 21:25

0

So I actually studied around online and found How to merge multiple feature vectors in DataFrame? which seems to be an accurate answer to my question.

answered Oct 01 '18 at 21:25

JLTChiu

983
3
12
28

How to train Logistic Regression with apache spark ML with two column of text as features?

1 Answers1