2

I need to read a csv file in Spark with specific date-format. But I still end up with the date column interpreted as a general string instead of date.

Input csv file:

cat oo2.csv
date,something
2013.01.02,0
2013.03.21,0

with Spark 3.1.1 :

import org.apache.spark.sql.SparkSession

val spark:SparkSession = SparkSession.builder().master("local[*]")
    .appName("Hmmm")
    .getOrCreate()

val oo = spark.read.
  option("header","true").
  option("inferSchema","true").
  option("dateFormat","yyyy.MM.dd").
  csv("oo2.csv")

oo.printSchema()
oo.show()

I get:

root
 |-- date: string (nullable = true)
 |-- something: integer (nullable = true)
+----------+---------+
|      date|something|
+----------+---------+
|2013-01-02|        0|
|2013-03-21|        0|
+----------+---------+

Am I missing something? It should be simple, basically similar approach is described in: https://stackoverflow.com/a/46299504/1408096 but no joy...

ps if I try to parse the date outside Spark

import java.text.SimpleDateFormat

val a = new SimpleDateFormat("yyyy.MM.dd")
a.parse("2013.01.02")

It works perfectly fine

xhudik
  • 2,414
  • 1
  • 21
  • 39

1 Answers1

3

Spark cannot infer date type. There are 2 possibilities:

  1. Schema needs to be specified:
val df = spark.read
              .option("header",true)
              .option("dateFormat","yyyy.MM.dd")
              .schema("date date, something int")
              .csv("oo2.csv")
  1. a workaround like:
val oo = spark.read.
  option("header","true").
  //infer schema for other types
  option("inferSchema","true").
  csv("oo2.csv").
  //manually create a new column with date
  withColumn("new_date", to_date(col("date"),"yyyy.MM.dd."))

We have raised a new feature request for ability to infer date type during reading. Let's see how dev community will respond

xhudik
  • 2,414
  • 1
  • 21
  • 39
mck
  • 40,932
  • 13
  • 35
  • 50
  • nope, schema doesnt need to be specified (there is `option("inferSchema","true")` to get schema automatically, This is working fine with the exception of `date` column – xhudik Apr 03 '21 at 20:02
  • That's exactly the problem. `inferSchema` cannot infer date type columns. – mck Apr 03 '21 at 20:02
  • hmmm - if I look at: https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L719-L723 - it seems DateFormat can be used. If schema can be inferred for `integer` or `double` why not for `date`. This would be strange... Why do you think `InferSchema` cannot work with `date`? – xhudik Apr 03 '21 at 20:12
  • 1
    Yes, dateFormat can be used as long as you specify a schema. That's just from my experience, and your question also confirmed that inferSchema doesn't work with dateFormat. – mck Apr 03 '21 at 20:17
  • The link you cited did not say that dateFormat can be used with inferSchema. – mck Apr 03 '21 at 20:18
  • thanks @mck - I hope you are not right :) - that would be weird if `InferSchema` works with `Integer` but not with `Date` – xhudik Apr 03 '21 at 20:35
  • 1
    @xhudik see [this post](https://stackoverflow.com/a/46595057) for how inferSchema works... – mck Apr 04 '21 at 07:02
  • indeed, @mck, you are right, `timestamp` can be inferred but not `date`. Strange :(. Thanks for this finding! – xhudik Apr 04 '21 at 10:53