3

I have a DataFrame that looks like this:

+--------------------+------------------+
|            features|           labels |
+--------------------+------------------+
|[-0.38475, 0.568...]|          label1  |
|[0.645734, 0.699...]|          label2  |
|     .....          |          ...     |
+--------------------+------------------+

Both columns are of String type (StringType()), I would like to fit this into spark ml randomForest. To do so, I need to convert the features columns into a vector containing floats. Does any one have any idea How to do so ?

zero323
  • 322,348
  • 103
  • 959
  • 935
ABK
  • 511
  • 1
  • 7
  • 19

1 Answers1

6

If you are using Spark 2.x, I believe that this is what you need :

from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
from pyspark.ml.linalg import VectorUDT
from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

def parse(s):
  try:
    return Vectors.parse(s).asML()
  except:
    return None

parse_ = udf(parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## |        features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1|          0.0|
## |[0.645734,0.699]|label2|          1.0|
## +----------------+------+-------------+

With Spark 1.6, it isn't much different :

from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT

df = sqlContext.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))

parse_ = udf(Vectors.parse, VectorUDT())

parsed = df.withColumn("features", parse_("features"))

indexer = StringIndexer(inputCol="label", outputCol="label_indexed")

indexer.fit(parsed).transform(parsed).show()
## +----------------+------+-------------+
## |        features| label|label_indexed|
## +----------------+------+-------------+
## |[-0.38475,0.568]|label1|          0.0|
## |[0.645734,0.699]|label2|          1.0|
## +----------------+------+-------------+

Vectors has a parse function that can help you achieve what you are trying to do.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Thanks u, but when I do this I get the following error: AttributeError: 'function' object has no attribute '_get_object_id' – ABK Jun 02 '17 at 11:46
  • With this exact code, I get this error: TypeError: cannot serialize None of type But it seems like we are not working with the same version of spark. In fact, I replaced : from pyspark.mllib.linalg import Vectors from pyspark.ml.linalg import VectorUDT With: from pyspark.mllib.linalg import Vectors, VectorUDT and spark.createDataFrame with sqlContext.createDataFrame Because they are not supported on my version – ABK Jun 02 '17 at 12:06
  • This means that the parse(s) function is only returning Nones in my version – ABK Jun 02 '17 at 12:09
  • what version are you using ? And what does the second error you have mentioned do with the first one ? For the first one you probably are using a reserved word for a column name like df.count – eliasah Jun 02 '17 at 12:12
  • I'm using spark 1.6.2, in fact the first error occurs when I try to apply parse(s) function to my own dataframe , and the second is raised when I run the exact code you posted – ABK Jun 02 '17 at 12:46