3

I have a LabeledPoint on witch I want to run logistic regression:

Data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = 
MapPartitionsRDD[3335] at map at <console>:44

using code:

val splits = Data.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)

val numIterations = 100
val model = LogisticRegressionWithSGD.train(training, numIterations)

My problem is that I don't want to use all of the features from LabeledPoint, but only some of them. I've got a list o features that I wan't to use, for example:

LoF=List(223244,334453...

How can I get only the features that I want to use from LabeledPoint o select them in logistic regression?

eliasah
  • 39,588
  • 11
  • 124
  • 154
Maju116
  • 1,607
  • 1
  • 15
  • 30

1 Answers1

4

Feature selection allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.

One way to do what you are seeking is using the ElementwiseProduct.

ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.

So if we set the weight of the features we want to keep to 1.0 and the others to 0.0, we can say that the remaining resulting features computed by the ElementwiseProduct of the original vector and the 0-1 weight vectors will select the features we need :

import org.apache.spark.mllib.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors

// Creating dummy LabeledPoint RDD
val data = sc.parallelize(Array(LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), LabeledPoint(1.0,Vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),LabeledPoint(0.0,Vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))

data.toDF.show

// +-----+--------------------+
// |label|            features|
// +-----+--------------------+
// |  1.0|[1.0,0.0,3.0,5.0,...|
// |  1.0|[4.0,5.0,6.0,1.0,...|
// |  0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+

// You'll need to know how many features you have, I have used 5 for the example
val numFeatures = 5

// The indices represent the features we want to keep 
// Note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = List(3, 4).toArray

// Now we can create our weights vectors
val weights = Array.fill[Double](indices.size)(1)

// Create the sparse vector of the features we need to keep.
val transformingVector = Vectors.sparse(numFeatures, indices, weights)

// Init our vector transformer
val transformer = new ElementwiseProduct(transformingVector)

// Apply it to the data.
val transformedData = data.map(x => LabeledPoint(x.label,transformer.transform(x.features).toSparse))

transformedData.toDF.show

// +-----+-------------------+
// |label|           features|
// +-----+-------------------+
// |  1.0|(5,[3,4],[5.0,1.0])|
// |  1.0|(5,[3,4],[1.0,2.0])|
// |  0.0|      (5,[4],[2.0])|
// +-----+-------------------+

Note:

  • You noticed that I used the sparse vector representation for space optimization.
  • features are sparse vectors.
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • 1
    @zero323 tell me what you think! :) – eliasah Nov 17 '15 at 15:11
  • After running your code I get: `scala> transformedData.collect() Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((1.0,(5,[3,4],[5.0,1.0])), (1.0,(5,[3,4],[1.0,2.0])), (0.0,(5,[4],[2.0])))` Didn't it keept features nr 4 and 5 instead 3 and 4? – Maju116 Nov 17 '15 at 15:31
  • Ok that's normal, indices start with 0. – eliasah Nov 17 '15 at 15:38
  • My mistake, sorry. I've got one more question. My data is in `LabeledPoint` sparse reprezentation, how can I change it to dense to run your code? – Maju116 Nov 17 '15 at 15:41
  • It's straighforward! the data has just to be an `RDD[LabeledPoint]` – eliasah Nov 17 '15 at 15:52
  • Thanks! It was very helpful! I've got one last question. If I keep only some of my features there's a possibility that I'll get some empty rows (every value for kept features will be 0). Should I remove those rows before running logistic regression and if yes how can I do it? – Maju116 Nov 17 '15 at 16:01
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/95367/discussion-between-eliasah-and-maju116). – eliasah Nov 17 '15 at 16:04
  • Thanks @zero323 ! I didn't find another wait to do it! This is how you would have done it? – eliasah Nov 17 '15 at 16:47
  • I've been thinking about UDF but `ElementwiseProduct` is much cleaner. It is a really neat solution :) BTW, could you take look at [this](http://stackoverflow.com/a/33757380/1560062)? Do you think I've missed some obvious solution? – zero323 Nov 17 '15 at 16:58
  • Thanks again @zero323. I'll take a look on it once I get home. – eliasah Nov 17 '15 at 18:15