I'm having a pretty troubling problem with the LAST aggregate in SparkSQL in Spark 2.3.1. It seems to give me around 4 bad results -- that is, values that are not LAST by the specified partitioning and order -- in 500,000 (logical SQL, not Spark) partitions, something like 50MM records. Smaller batches are worse -- the number of errors per batch seems pretty consistent, although I don't think I tried anything smaller than 100,000 logical SQL partitions.
I have roughly 66 FIRST or LAST aggregates, a compound (date, integer) logical sql partition key and a compound (string, string) sort key. I tried converting the four-character numeric values into integers, then I combined them into a single integer. Neither of those moves resolved the problem. Even with a single integer sort key, I was getting a few bad values.
Typically, there are fewer than a hundred records in each partition, and a handful of non-NULL values for any field. It never seems to get the second to last value; it's always at least third to last.
I did try to replace the simple aggregate with a windowed aggregate with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. The one run I did of that gave me six bad records -- the compound integer key had given me only two, but I didn't do enough runs to really compare the approaches and of course I need zero.
Why do I not seem to be able to rely on LAST()? Here's a test which just illustrates the unwindowed version of the LAST function, although my partitioning and sorting fields are each two fields.
import org.apache.spark.sql.functions.{expr}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.scalamock.scalatest.MockFactory
import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
import collection.JavaConverters._
class LastTest extends FlatSpec with Matchers with MockFactory with BeforeAndAfterAll {
implicit val spark: SparkSession = SparkSession.builder().appName("Last Test").master("local[2]").getOrCreate()
import spark.implicits._
// TRN_DATE, TRN_NUMBER, TRN_TIMESTAMP, DETAILS, DATE_TIME, QUE_LINE_ID, OPR_INITIALS, ENTRY_TYPE, HIST_NO, SUB_HIST_NO, MSG_INFO
"LAST" must "work with GROUP BY" in {
val lastSchema = StructType(Seq(
StructField("Pfield", IntegerType) // partition field
, StructField("Ofield", IntegerType) // order field
, StructField("Vfield", StringType) // value field
))
val last:DataFrame = spark.createDataFrame(List[Row](
Row(0, 1, "Pencil")
, Row(5, 1, "Aardvark")
, Row(10, 1, "Monastery")
, Row(10, 2, "Remediation")
, Row(15, 1, "Parcifal")
, Row(20, 1, "Montenegro")
, Row(20, 2, "Susquehana")
, Row(20, 3, "Perfidy")
, Row(20, 4, "Prosody")
).asJava
, lastSchema
).repartition(expr("MOD(Pfield, 4)"))
last.createOrReplaceTempView("last_group_test")
// apply the unwindowed last
val unwindowed:DataFrame = spark.sql("SELECT Pfield, LAST(Vfield) AS Vlast FROM (SELECT * FROM last_group_test ORDER BY Pfield, Ofield) GROUP BY Pfield ORDER BY Pfield")
unwindowed.show(5)
// apply a windowed last
val windowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
windowed.show(5)
// include the partitioning function in the window
val excessivelyWindowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY MOD(Pfield, 4), Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
excessivelyWindowed.show(5)
assert(unwindowed.collect() === windowed.collect() && windowed.collect() === excessivelyWindowed.collect())
assert(windowed.count() == 5)
assert(windowed.filter("Pfield=20").select($"Vlast").collect()(0)(0)==="Prosody")
}
}
So, all three datasets are the same, which is nice. But, if I apply this logic to my actual needs -- which has sixty-odd columns, almost all of which are LAST values -- I'll get an error, it looks like about 4 times in a batch of 500,000 groups. If I run the dataset 30 times, I'll get 30 different sets of bad records.
Am I doing something wrong, or is this a defect? Is it a known defect? Is it fixed in 2.4? I didn't see if, but "aggregates simply don't work sometimes" can't be something they released with, right?