I'm reading a large number of CSVs from S3 (everything under a key prefix) and creating a strongly-typed Dataset
.
val events: DataFrame = cdcFs.getStream()
events
.withColumn("event", lit("I"))
.withColumn("source", lit(sourceName))
.as[TradeRecord]
where TradeRecord
is a case class that can normally be deserialized into by SparkSession implicits. However, for a certain batch, a record is failing to deserialize. Here's the error (stack trace omitted)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "deal")
- root class: "com.company.trades.TradeRecord"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
deal
being a field of TradeRecord
that should never be null in source data (S3 objects), so it's not an Option
.
Unfortunately the error message doesn't give me any clue as to what the CSV data looks like, or even which CSV file it's coming from. The batch consists of hundreds of files, so I need a way to narrow this down to at most a few files to investigate the issue.