I am using python to create Logistic Regression And tured to mllib for better performence
I installed spark and pyspark.
My data is stored in numpy array and i can easily convert it into pandas dataframe.
I tried to create a spark dataframe to feed the model, but creating the dataframe is too slow and using regular Sklearn is just faster in overall
I found that using Arrow algorithm using this conf
('spark.sql.execution.arrow.enabled', 'true')
can make it faster, But it's still just too slow, and not even utilizing cores (I checked my configurations, and both the executor and driver are set up to have multiple cores, but they are not utilized)
I tried using RDD instead of a dataframe with this code
d = [row.tolist() for row in encoded_data]
d = [LabeledPoint(label, row) for label, row in zip(y_train, d)]
rdd = spark.parallelize(d)
lr.fit(rdd)
But I keep getting this error
AttributeError: 'RDD' object has no attribute '_jdf'
I found this SO question regarding a similar issue, But It does not feet my case, My data does not come from a text file, but from a numpy array, I could write the data to file and then read it, but it does not make sense In my use case.
I would like to find a better way of using data from a numpy array - I have two arrays - one encoded_data than is (n*m) size array of features, and y_train that is (n*1) array of labels. I need to feed it to a Logistic regression in order to improve my training times.
The data is dense for a reason, these are numeric features vectors, not one hot, the reason I turned to Spark was to utilize local cores that are not utilized in Sklearn training.
Thanks.