3

How should I properly perform date time parsing with spark 2.0 dataset API?

There are lots of samples for data frame / RDD like

A class like

case class MyClass(myField:java.sql.Datetime)

val mynewDf = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .option("charset", "UTF-8")
    .option("delimiter", ",")
    .csv("pathToFile.csv")
    .as[MyClass]

Is not enough to cast the type. How should I perform this properly using the data set API?

edit

loading the data works. Eg. a print schema shows myDateFiled: timestamp (nullable = true) But a myDf.show results in a

java.lang.IllegalArgumentException
        at java.sql.Date.valueOf(Date.java:143)

which lead me to believe that my parsing of the dates was incorrect. How can this be?

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

9

A correct representation of a timestamp is java.sql.Timestamp so class should be defined as

case class MyClass(myField: java.sql.Timestamp)

with coressponding data:

myField
"2016-01-01 00:00:03"

If this conditions are satisfied all you have to do is to provide schema:

spark.read
  .options(Map("header" -> "true"))
  .schema(StructType(Seq(StructField("myField", TimestampType, false))))
  .csv(...)
  .as[MyClass]

It is possible to provide alternative date format using dateFormat with SimpleDateFormat string.

Schema definition can be replaced with type casting before .as[MyClass]:

spark.read
  .options(Map("header" -> "true"))
  .csv(...)
  .withColumn("myField", $"myField".cast("timestamp"))
  .as[MyClass]

For DateType use java.sql.Date.

zero323
  • 322,348
  • 103
  • 959
  • 935