I use Spark 2.1.1. I do many joins and selects on an input DS (inputDs) in a loop by hour it looks like this:
val myDs = Iterator.iterate(fromDate)(_.plus(ofHours(1))).takeWhile(_.isBefore(toDate)).map(next => {
getDsForOneHour(inputDs, next.getYear, next.getMonthValue, next.getDayOfMonth, next.getHour)
}).reduce(_.union(_))
def getDsForOneHour(ds: Dataset[I], year:Int, month:Int, day:Int, hour: Int)(implicit sql: SQLImplicits):Dataset[I]= {
ds.where(col("year") === year and col("month") === month and col("day") === day and col("hour") === hour)
}
I run that code using spark-testing-base and it takes about 3 minutes to complete operations for one month (~30*24 unions&selects). These are all lazy operations I'm wondering why it takes so much time Spark to build myDs ?