Pyspark count distinct values of labels of a Labelled point RDD

Question

I have an RDD of labeled point in Spark. I want to count all the distinct values of labels. I try something

from pyspark.mllib.regression import LabeledPoint

train_data =  sc.parallelize([ LabeledPoint(1.0, [1.0, 0.0, 3.0]),LabeledPoint(2.0, [1.0, 0.0, 3.0]),LabeledPoint(1.0, [1.0, 0.0, 3.0]) ])

train_data.reduceByKey(lambda x : x.label).collect()

But I get

TypeError: 'LabeledPoint' object is not iterable

I use Spark 2.1 and python 2.7. Thanks for any help.

Please provide a sample of your data – desertnaut Oct 26 '17 at 07:15 — desertnaut, Oct 26 '17 at 07:15
@desertnaut I have update my question with some data – Michail N Oct 26 '17 at 07:31 — Michail N, Oct 26 '17 at 07:31

score 3 · Accepted Answer · answered Oct 26 '17 at 07:55

3

You just have to convert your LabeledPoint to a key-value RDD, and then count by key:

spark.version
# u'2.1.1'

from pyspark.mllib.regression import LabeledPoint

train_data =  sc.parallelize([ LabeledPoint(1.0, [1.0, 0.0, 3.0]),LabeledPoint(2.0, [1.0, 0.0, 3.0]),LabeledPoint(1.0, [1.0, 0.0, 3.0]) ])

dd = train_data.map(lambda x: (x.label, x.features)).countByKey()
dd
# {1.0: 2, 2.0: 1}

answered Oct 26 '17 at 07:55

desertnaut

57,590
26
140
166

This solves the OP's pb. But it would be nice to have an explanation :) – eliasah Oct 26 '17 at 07:57
1

Here you go https://stackoverflow.com/questions/24804619/how-does-spark-aggregate-function-aggregatebykey-work and https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html you can add those links as extra reading :) – eliasah Oct 26 '17 at 07:59

Pyspark count distinct values of labels of a Labelled point RDD

1 Answers1