Spark DataSet date time parsing

Question

How should I properly perform date time parsing with spark 2.0 dataset API?

There are lots of samples for data frame / RDD like

A class like

case class MyClass(myField:java.sql.Datetime)

val mynewDf = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .option("charset", "UTF-8")
    .option("delimiter", ",")
    .csv("pathToFile.csv")
    .as[MyClass]

Is not enough to cast the type. How should I perform this properly using the data set API?

edit

loading the data works. Eg. a print schema shows myDateFiled: timestamp (nullable = true) But a myDf.show results in a

java.lang.IllegalArgumentException
        at java.sql.Date.valueOf(Date.java:143)

which lead me to believe that my parsing of the dates was incorrect. How can this be?

Sorry, it is not really clear to me what you're trying to achieve here and what your problem is. Could you show us a sample of the input file and detail a bit your question ? — cheseaux, Oct 06 '16 at 12:33
Probably should rephrase: how to perform an explicit cast for datasets. — Georg Heiler, Oct 06 '16 at 12:42
@cheseaux please also see the clarification in the latest edit — Georg Heiler, Oct 06 '16 at 12:44

zero323 · Accepted Answer · 2016-10-06T12:59:29.200

A correct representation of a timestamp is java.sql.Timestamp so class should be defined as

case class MyClass(myField: java.sql.Timestamp)

with coressponding data:

myField
"2016-01-01 00:00:03"

If this conditions are satisfied all you have to do is to provide schema:

spark.read
  .options(Map("header" -> "true"))
  .schema(StructType(Seq(StructField("myField", TimestampType, false))))
  .csv(...)
  .as[MyClass]

It is possible to provide alternative date format using dateFormat with SimpleDateFormat string.

Schema definition can be replaced with type casting before .as[MyClass]:

spark.read
  .options(Map("header" -> "true"))
  .csv(...)
  .withColumn("myField", $"myField".cast("timestamp"))
  .as[MyClass]

For DateType use java.sql.Date.

Spark DataSet date time parsing

edit

1 Answers1