PySpark Python script work only on Spark 2.2

Question

I have completed a script using Python 2.7, PySpark, Spark 2.2. The algoritm after calculate some values, will save on Cassandra database using spark-cassandra connector. The algoritm works fine in stand-alone run.

I should however run it on Spark 2.0.2 or Spark 2.1. My problem is that some collect operation (as also Dataframe.show()) on Spark 2.1 and Spark 2.0.2 are locked. I investigate and seems that after dataframe join operations the execution it blocks.Do you have any suggestions for me? (Tuning, Spark Ui check, etc.)

 condition = [df_regressionLine['location_number'] == seasonalRatio['location_number'],
             df_regressionLine['location_type'] == seasonalRatio['location_type'],
             df_regressionLine['pag_code'] == seasonalRatio['pag_code'],
             df_regressionLine['PERIOD'] == seasonalRatio['period']]

freDataFrame = df_regressionLine.join(seasonalRatio, condition)

It is not even clear what you mean. What does it mean that "operation is locked". I'd recommend [mcve] or at least some diagnostic enough. And if locked means slow, then search for dozens of questions discussing data skews with `joins`. — zero323, Jan 20 '18 at 15:19
Ok, I'm sorry. I'm new in Spark / Python. However I investigate and the problems seems the UDF functions. The algoritm is very slow and throw Invalid PythonUDF filterBlendingRatio(blendedRatios#1440, INDEX#831), requires attributes from more than one child. — Peppe Gallo, Jan 20 '18 at 16:58
That's fine, but you should really work on MCVE. In general it seems you have multiple problems so I'd separate this into multiple questions. There is even no udf in your question, and error you see is a known bug (https://stackoverflow.com/q/38491377/6910411) and yes, UDFs are slow, and should be avoided. Vectorized ones are faster, but not available before 2.3. — zero323, Jan 20 '18 at 23:33

PySpark Python script work only on Spark 2.2

0 Answers0