1

I have an RDD of labeled point in Spark. I want to count all the distinct values of labels. I try something

from pyspark.mllib.regression import LabeledPoint

train_data =  sc.parallelize([ LabeledPoint(1.0, [1.0, 0.0, 3.0]),LabeledPoint(2.0, [1.0, 0.0, 3.0]),LabeledPoint(1.0, [1.0, 0.0, 3.0]) ])

train_data.reduceByKey(lambda x : x.label).collect()

But I get

TypeError: 'LabeledPoint' object is not iterable

I use Spark 2.1 and python 2.7. Thanks for any help.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Michail N
  • 3,647
  • 2
  • 32
  • 51

1 Answers1

3

You just have to convert your LabeledPoint to a key-value RDD, and then count by key:

spark.version
# u'2.1.1'

from pyspark.mllib.regression import LabeledPoint

train_data =  sc.parallelize([ LabeledPoint(1.0, [1.0, 0.0, 3.0]),LabeledPoint(2.0, [1.0, 0.0, 3.0]),LabeledPoint(1.0, [1.0, 0.0, 3.0]) ])

dd = train_data.map(lambda x: (x.label, x.features)).countByKey()
dd
# {1.0: 2, 2.0: 1}    
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • This solves the OP's pb. But it would be nice to have an explanation :) – eliasah Oct 26 '17 at 07:57
  • 1
    Here you go https://stackoverflow.com/questions/24804619/how-does-spark-aggregate-function-aggregatebykey-work and https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html you can add those links as extra reading :) – eliasah Oct 26 '17 at 07:59