I have completed a script using Python 2.7, PySpark, Spark 2.2. The algoritm after calculate some values, will save on Cassandra database using spark-cassandra connector. The algoritm works fine in stand-alone run.
I should however run it on Spark 2.0.2 or Spark 2.1. My problem is that some collect operation (as also Dataframe.show()) on Spark 2.1 and Spark 2.0.2 are locked. I investigate and seems that after dataframe join operations the execution it blocks.Do you have any suggestions for me? (Tuning, Spark Ui check, etc.)
condition = [df_regressionLine['location_number'] == seasonalRatio['location_number'],
df_regressionLine['location_type'] == seasonalRatio['location_type'],
df_regressionLine['pag_code'] == seasonalRatio['pag_code'],
df_regressionLine['PERIOD'] == seasonalRatio['period']]
freDataFrame = df_regressionLine.join(seasonalRatio, condition)