I am trying to train a Logistic regression model with Apache Spark. My dataframe looks like this.
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("text_A", DataTypes.StringType, false, Metadata.empty()),
new StructField("text_B", DataTypes.StringType, false, Metadata.empty()),
});
Dataset<Row> trainingDataFrame =spark.createDataFrame(trainingdata, schema);
I want to use both text_A
and text_B
as features to train the model, yet I don't want to just concatenate both of them. I want to make them separate category of features (So if the same word shows up in text_A
and text_B
, they are considered as different feature). In the current lr class, it is using features
column as the default and only features to train the model. Is it possible to use two different column as the training features? Or how can I merge these two text features into a single features
column for training?