3

I am parsing a csv file having data as

03-10-2016,18:00:00,2,6

When I am reading file creating schema as below

StructType schema = DataTypes.createStructType(Arrays.asList(
                DataTypes.createStructField("Date", DataTypes.DateType, false),
                DataTypes.createStructField("Time", DataTypes.TimestampType, false),
                DataTypes.createStructField("CO(GT)", DataTypes.IntegerType, false),
                DataTypes.createStructField("PT08.S1(CO)", DataTypes.IntegerType, false)))
Dataset<Row> df = spark.read().format("csv").option("Date", "dd-MM-yyyy").schema(schema).load("src/main/resources/AirQualityUCI/sample.csv");

Its producing below error as

Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException
    at java.sql.Date.valueOf(Unknown Source)
    at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)

I feel that it is due to date format error. What are the ways of converting them into specific formats?

Utkarsh Saraf
  • 475
  • 8
  • 31

1 Answers1

7

Use dateFormat option when reading the CSV file(s) as follows:

val csvs = spark.
  read.
  format("csv").
  option("dateFormat", "dd-MM-yyyy"). // <-- should match 03-10-2016
  load(...)

The default for dateFormat is yyyy-MM-dd so it's no surprise you've got the parsing error.


Quoting from the javadoc of valueOf:

Throws IllegalArgumentException - if the date given is not in the JDBC date escape format (yyyy-[m]m-[d]d)

That means that the value is incorrect for the parser of valueOf.

I'd have two recommendations here:

  1. Read the dataset and show it to see what you have inside.

  2. Use dateFormat option to define the proper format (it's yyyy-MM-dd by default)

Find more about the format patterns in Date and Time Patterns (of java.text.SimpleDateFormat).

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420