I have saved spark dataframe using df.write.parquet("filePath")
as a parquet file. I want to apply Sklearn
SVM classifier over this dataframe columns as Multiclass SVM is absent in Spark ML/MLLib
library.
I tried reading the saved data with spark.read.parquet('FilePath')
and then converting it to pandas using toPandas()
. However this conversion from spark DataFrame
to pandas DataFrame
is taking too much time as data is huge(81GB). Is it possible to directly read the saved SPARK DataFrame
in pandas DataFrame
.
If not, what will be the best way to proceed further?