My dataset has multiple columns and I want to create feature vector with selective columns. I tried using VectorAssembler of org.apache.spark.ml.feature.VectorAssembler package but since my dataset has multiple null values, the transform method of VectorAssembler is failing. Is there any substitute for VectorAssembler? Any other method to create Dense Vectors that can be passed to machine learning classification models?
Asked
Active
Viewed 42 times
0
-
1This [question](https://stackoverflow.com/questions/32999099/handle-null-nan-values-in-spark-mllib-classifier) and this [answer](https://stackoverflow.com/a/41362543/6198942) state that `null` values are not meaningfull in spark classification algorithms. So I think your problem is more about how to handle those values than how to substitute the `VectorAssembler` – moe Oct 10 '17 at 20:18
-
Thanks for your response Moe. Yes, you are right. I figured that features cannot have null values so removed all rows with null values in feature column. VectorAssembler only takes numeric, binary and vector values and I had string value so even after handling null it was failing so I had to convert all my columns to double so VectorAssembler could work. – kanika Oct 11 '17 at 20:53