Fit a dataframe into randomForest pyspark

Question

I have a DataFrame that looks like this:

+--------------------+------------------+
|            features|           labels |
+--------------------+------------------+
|[-0.38475, 0.568...]|          label1  |
|[0.645734, 0.699...]|          label2  |
|     .....          |          ...     |
+--------------------+------------------+

Both columns are of String type (StringType()), I would like to fit this into spark ml randomForest. To do so, I need to convert the features columns into a vector containing floats. Does any one have any idea How to do so ?

eliasah · Accepted Answer · 2017-06-02T15:04:12.260

If you are using Spark 2.x, I believe that this is what you need :

from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
from pyspark.ml.linalg import VectorUDT
from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

def parse(s):
  try:
    return Vectors.parse(s).asML()
  except:
    return None

parse_ = udf(parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## |        features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1|          0.0|
## |[0.645734,0.699]|label2|          1.0|
## +----------------+------+-------------+

With Spark 1.6, it isn't much different :

from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT

df = sqlContext.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

parse_ = udf(Vectors.parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## |        features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1|          0.0|
## |[0.645734,0.699]|label2|          1.0|
## +----------------+------+-------------+

Vectors has a parse function that can help you achieve what you are trying to do.

Thanks u, but when I do this I get the following error: AttributeError: 'function' object has no attribute '_get_object_id' — ABK, Jun 02 '17 at 11:46
With this exact code, I get this error: TypeError: cannot serialize None of type But it seems like we are not working with the same version of spark. In fact, I replaced : from pyspark.mllib.linalg import Vectors from pyspark.ml.linalg import VectorUDT With: from pyspark.mllib.linalg import Vectors, VectorUDT and spark.createDataFrame with sqlContext.createDataFrame Because they are not supported on my version — ABK, Jun 02 '17 at 12:06
This means that the parse(s) function is only returning Nones in my version — ABK, Jun 02 '17 at 12:09
what version are you using ? And what does the second error you have mentioned do with the first one ? For the first one you probably are using a reserved word for a column name like df.count — eliasah, Jun 02 '17 at 12:12
I'm using spark 1.6.2, in fact the first error occurs when I try to apply parse(s) function to my own dataframe , and the second is raised when I run the exact code you posted — ABK, Jun 02 '17 at 12:46

Fit a dataframe into randomForest pyspark

1 Answers1

Linked