2

I am trying to build a Logistic Regression model with Apache Spark. Here is the code.

parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data 
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)

But I get this error:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

I am not sure how to work around this. Any help would be greately appreciated.

Community
  • 1
  • 1
ashishsjsu
  • 365
  • 1
  • 2
  • 9

1 Answers1

3

Problem you see is pretty much the same as the one I've described in How to use Java/Scala function from an action or a transformation? To transform you have to call Scala function, and it requires access to the SparkContext hence the error you see.

Standard way to handle this is to process only the required part of your data and then zip the results.

labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)

scaledData = (labels
    .zip(featuresTransformed)
    .map(lambda p: LabeledPoint(p[0], p[1])))

modelScaledSGD = LogisticRegressionWithSGD.train(...)

If don't plan to implement your own methods based on MLlib components it could easier to use high level ML API.

Edit:

There are two possible problems here.

  1. At this point LogisticRegressionWithSGD supports only binomial classification (Thanks to eliasah for pointing that out) . If you need multi-label classification you can replace it with LogisticRegressionWithLBFGS.
  2. StandardScaler supports only dense vectors so it has limited applications.
Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • it gives this [error](https://gist.github.com/eliasah/cc6287b4307123e5755a). I've never seen this error before. – eliasah Aug 25 '15 at 12:10
  • Works fine on 1.4.1. I'll download 1.3.1 later and check if I can reproduce the problem. `StandardScaler` won't work on sparse data but I doesn't look like it is the problem here. – zero323 Aug 25 '15 at 12:23
  • The solution sounds logical and correct to me and that's why I was surprised by the error. – eliasah Aug 25 '15 at 12:26
  • I think I know what is going on. You're using millionsong data but `LogisticRegressionWithSGD` is expecting binary classification problem. Could you checks logs for [this message](https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/mllib/src/main/scala/org/apache/spark/mllib/util/DataValidators.scala#L40)? – zero323 Aug 25 '15 at 12:38
  • Spark seems to be complaining about the input but that seems to be it. – eliasah Aug 25 '15 at 12:45
  • According to the [official documentation](http://spark.apache.org/docs/latest/mllib-classification-regression.html) LogisticRegression* supports Multiclass Classification. Is there a JIRA ticket concerning this or I have skipped something. – eliasah Aug 25 '15 at 12:52
  • As far as I can tell there is only place in GLM code where `Input validation failed` can be generated and it is from [here](https://github.com/apache/spark/blob/f5b028ed2f1ad6de43c8b50ebf480e1b6c047035/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L247). LR is using [`binaryLabelValidator`](https://github.com/apache/spark/blob/f5b028ed2f1ad6de43c8b50ebf480e1b6c047035/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala#L224) – zero323 Aug 25 '15 at 12:53
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/87888/discussion-between-eliasah-and-zero323). – eliasah Aug 25 '15 at 12:55
  • Thank you @zero323 for the answer. It worked perfectly. – ashishsjsu Aug 25 '15 at 21:44
  • @eliasah I used validateData=False option with LogisticRegressionWithLBFGS.train method and I don't get 'Input validation error' anymore. I hope it helps you too. – ashishsjsu Aug 25 '15 at 21:46
  • @ashishsjsu There no need for that. `LogisticRegressionWithLBFGS` has [numClasses](https://github.com/apache/spark/blob/45281664e0d3b22cd63660ca8ad6dd574f10e21f/python/pyspark/mllib/classification.py#L323) argument. – zero323 Aug 25 '15 at 21:52
  • I did use numClasses argument, but I was still getting the error. – ashishsjsu Aug 25 '15 at 23:06