Spark: How to run logistic regression using only some features from LabeledPoint?

Question

I have a LabeledPoint on witch I want to run logistic regression:

Data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = 
MapPartitionsRDD[3335] at map at <console>:44

using code:

val splits = Data.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)

val numIterations = 100
val model = LogisticRegressionWithSGD.train(training, numIterations)

My problem is that I don't want to use all of the features from LabeledPoint, but only some of them. I've got a list o features that I wan't to use, for example:

LoF=List(223244,334453...

How can I get only the features that I want to use from LabeledPoint o select them in logistic regression?

You'll need to perform feature engineering before-hand thus creating the appropriate training data. — eliasah, Nov 17 '15 at 13:19
@zero323 can't write an answer now. I'm stuck in a meeting. You can't do it please! — eliasah, Nov 17 '15 at 13:27

eliasah · Accepted Answer · 2016-01-15T12:46:24.257

Feature selection allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.

One way to do what you are seeking is using the ElementwiseProduct.

ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.

So if we set the weight of the features we want to keep to 1.0 and the others to 0.0, we can say that the remaining resulting features computed by the ElementwiseProduct of the original vector and the 0-1 weight vectors will select the features we need :

import org.apache.spark.mllib.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors

// Creating dummy LabeledPoint RDD
val data = sc.parallelize(Array(LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), LabeledPoint(1.0,Vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),LabeledPoint(0.0,Vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))

data.toDF.show

// +-----+--------------------+
// |label|            features|
// +-----+--------------------+
// |  1.0|[1.0,0.0,3.0,5.0,...|
// |  1.0|[4.0,5.0,6.0,1.0,...|
// |  0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+

// You'll need to know how many features you have, I have used 5 for the example
val numFeatures = 5

// The indices represent the features we want to keep 
// Note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = List(3, 4).toArray

// Now we can create our weights vectors
val weights = Array.fill[Double](indices.size)(1)

// Create the sparse vector of the features we need to keep.
val transformingVector = Vectors.sparse(numFeatures, indices, weights)

// Init our vector transformer
val transformer = new ElementwiseProduct(transformingVector)

// Apply it to the data.
val transformedData = data.map(x => LabeledPoint(x.label,transformer.transform(x.features).toSparse))

transformedData.toDF.show

// +-----+-------------------+
// |label|           features|
// +-----+-------------------+
// |  1.0|(5,[3,4],[5.0,1.0])|
// |  1.0|(5,[3,4],[1.0,2.0])|
// |  0.0|      (5,[4],[2.0])|
// +-----+-------------------+

Note:

You noticed that I used the sparse vector representation for space optimization.
features are sparse vectors.

After running your code I get: `scala> transformedData.collect() Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((1.0,(5,[3,4],[5.0,1.0])), (1.0,(5,[3,4],[1.0,2.0])), (0.0,(5,[4],[2.0])))` Didn't it keept features nr 4 and 5 instead 3 and 4? — Maju116, Nov 17 '15 at 15:31
My mistake, sorry. I've got one more question. My data is in `LabeledPoint` sparse reprezentation, how can I change it to dense to run your code? — Maju116, Nov 17 '15 at 15:41
It's straighforward! the data has just to be an `RDD[LabeledPoint]` — eliasah, Nov 17 '15 at 15:52
Thanks! It was very helpful! I've got one last question. If I keep only some of my features there's a possibility that I'll get some empty rows (every value for kept features will be 0). Should I remove those rows before running logistic regression and if yes how can I do it? — Maju116, Nov 17 '15 at 16:01
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/95367/discussion-between-eliasah-and-maju116). — eliasah, Nov 17 '15 at 16:04
Thanks @zero323 ! I didn't find another wait to do it! This is how you would have done it? — eliasah, Nov 17 '15 at 16:47
I've been thinking about UDF but `ElementwiseProduct` is much cleaner. It is a really neat solution :) BTW, could you take look at [this](http://stackoverflow.com/a/33757380/1560062)? Do you think I've missed some obvious solution? — zero323, Nov 17 '15 at 16:58
Thanks again @zero323. I'll take a look on it once I get home. — eliasah, Nov 17 '15 at 18:15

Spark: How to run logistic regression using only some features from LabeledPoint?

1 Answers1