I have the following input array
val bins = (("bin1",1.0,2.0),("bin2",3.0,4.0),("bin3",5.0,6.0))
Basically the strings "bin1" refer to values in a reference column on which dataframe is filtered - a new column is created from another column based on boundry conditions in remaining two doubles in the array
var number_of_dataframes = bins.length
var ctempdf = spark.createDataFrame(sc.emptyRDD[Row],train_data.schema)
ctempdf = ctempdf.withColumn(colName,col(colName))
val t1 = System.nanoTime
for ( x<- 0 to binputs.length-1)
{
var tempdf = train_data.filter(col(refCol) === bins(x)._1)
//println(binputs(x)._1)
tempdf = tempdf.withColumn(colName,
when(col(colName) < bins(x)._2, bins(x)._2)
when(col(colName) > bins(x)._3, bins(x)._3)
otherwise(col(colName)))
ctempdf = ctempdf.union(tempdf)
val duration = (System.nanoTime - t1) / 1e9d
println(duration)
}
The code above works incrementally slowly for every increasing value of bins - Is there a way I can speed this up drastically - because this code is again nested in another loop.
I have used checkpoint / persist / cache and these are not helping