-1

I am using python to create Logistic Regression And tured to mllib for better performence

I installed spark and pyspark.

My data is stored in numpy array and i can easily convert it into pandas dataframe.

I tried to create a spark dataframe to feed the model, but creating the dataframe is too slow and using regular Sklearn is just faster in overall

I found that using Arrow algorithm using this conf

('spark.sql.execution.arrow.enabled', 'true')

can make it faster, But it's still just too slow, and not even utilizing cores (I checked my configurations, and both the executor and driver are set up to have multiple cores, but they are not utilized)

I tried using RDD instead of a dataframe with this code

d = [row.tolist() for row in encoded_data] 
d = [LabeledPoint(label, row) for label, row in zip(y_train, d)]
rdd = spark.parallelize(d)
lr.fit(rdd)

But I keep getting this error

AttributeError: 'RDD' object has no attribute '_jdf'

I found this SO question regarding a similar issue, But It does not feet my case, My data does not come from a text file, but from a numpy array, I could write the data to file and then read it, but it does not make sense In my use case.

I would like to find a better way of using data from a numpy array - I have two arrays - one encoded_data than is (n*m) size array of features, and y_train that is (n*1) array of labels. I need to feed it to a Logistic regression in order to improve my training times.

The data is dense for a reason, these are numeric features vectors, not one hot, the reason I turned to Spark was to utilize local cores that are not utilized in Sklearn training.

Thanks.

thebeancounter
  • 4,261
  • 8
  • 61
  • 109

1 Answers1

2

The source of the error is usage of incompatible API's.

Spark provides two ML APIs:

  • Old pyspark.mllib which is designed to work with RDDs
  • New pyspark.ml which is designed to work with DataFrames

You lr object clearly belongs to the latter one, while parallelize is an RDD. See What's the difference between Spark ML and MLLIB packages as suggested in the comments.

Additionally your whole premise is wrong. If you're model can be easily trained on a local data, on single node, using standard Python libraries, then Spark ML has no chance to win here. Spark is all about scaling your process to large datasets, not about reducing latency.

See Why is Apache-Spark - Python so slow locally as compared to pandas?

On top of that using dense structures (I assume this is what you mean by NumPy arrays) to represent one-hot-encoded data is very inefficient, and will significantly affect the performance in general (Spark comes with its own Pipeline API, which among other tools, provides one-hot-encoder yielding Sparse representation).

Finally parallelizing local collections is a testing and development tool, not a production solution.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Thanks for the reply, the thing is my data is not one hot encoded, this are numerical vectors... I am trying to use spark locally in order to utilize all the machines cores, something that is not available in sklearn training. So multiprocessing is needed here, and the data is dense... – thebeancounter Jun 06 '19 at 11:24
  • Can you maybe offer some code example that should work? – thebeancounter Jun 06 '19 at 11:25