Converting the float column in Spark Dataframe to VectorUDT

Question

I was trying to use the pyspark.ml.evaluation Binary classification metric like below

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
print evaluator.evaluate(predictions)

My Predictions data frame looks like this:

predictions.select('rating','prediction')
predictions.show()
+------+------------+
|rating|  prediction|
+------+------------+
|     1|  0.14829934|
|     1|-0.017862909|
|     1|   0.4951505|
|     1|0.0074382657|
|     1|-0.002562912|
|     1|   0.0208337|
|     1| 0.049362548|
|     1|  0.09693333|
|     1|  0.17998546|
|     1| 0.019649783|
|     1| 0.031353004|
|     1|  0.03657037|
|     1|  0.23280995|
|     1| 0.033190556|
|     1|  0.35569906|
|     1| 0.030974165|
|     1|   0.1422375|
|     1|  0.19786166|
|     1|  0.07740938|
|     1|  0.33970386|
+------+------------+
only showing top 20 rows

The datatype of each column is as follows:

predictions.printSchema()
root
 |-- rating: integer (nullable = true)
 |-- prediction: float (nullable = true)

Now I get an error with above Ml code saying prediction column is Float and expected a VectorUDT.

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     51                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     52             if s.startswith('java.lang.IllegalArgumentException: '):
---> 53                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     54             raise
     55     return deco

IllegalArgumentException: u'requirement failed: Column prediction must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually FloatType.'

So I thought of converting the predictions column from float to VectorUDT as below:

Applying the schema to the dataframe to convert the float column type to VectorUDT

from pyspark.sql.types import IntegerType, StructType,StructField

schema = StructType([
    StructField("rating", IntegerType, True),
    StructField("prediction", VectorUDT(), True)
])


predictions_dtype=sqlContext.createDataFrame(prediction,schema)

But Now I get this error.

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-30-8fce6c4bbeb4> in <module>()
      4 
      5 schema = StructType([
----> 6     StructField("rating", IntegerType, True),
      7     StructField("prediction", VectorUDT(), True)
      8 ])

/Users/i854319/spark/python/pyspark/sql/types.pyc in __init__(self, name, dataType, nullable, metadata)
    401         False
    402         """
--> 403         assert isinstance(dataType, DataType), "dataType should be DataType"
    404         if not isinstance(name, str):
    405             name = name.encode('utf-8')

AssertionError: dataType should be DataType

It takes so much time to run an ml algo in spark libraries with so many weird errors. Even I tried Mllib with RDD data. That is giving the ValueError: Null pointer exception.

Please advise.

mind telling why it has been down voted? – Baktaawar Nov 03 '16 at 18:37 — Baktaawar, Nov 03 '16 at 18:37

score 3 · Accepted Answer · edited May 23 '17 at 10:30

3

Try:

as_prob = udf(lambda x: DenseVector([1 - x, x]), VectorUDT())

df.withColumn("prediction", as_prob(df["prediction"]))

Source: Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

edited May 23 '17 at 10:30

Community

1
1

answered Nov 04 '16 at 21:03

Cool. This worked. Quick question though. pyspark.ml.BinaryClassificationEvaluator should take the original label and predicted value as input. Original Label should be int sine it is 1 or 0. MY original rating was int but then it gave an error that it needs the label col to be double?. I am not sure why was that required. It coverts the label to float values too and then compares? – Baktaawar Nov 04 '16 at 22:02

Converting the float column in Spark Dataframe to VectorUDT

Applying the schema to the dataframe to convert the float column type to VectorUDT

1 Answers1