0

I am parsing a csv file having data as:

2016-10-03, 18.00.00, 2, 6

When I am reading file creating schema as below:

StructType schema = DataTypes.createStructType(Arrays.asList(
                DataTypes.createStructField("Date", DataTypes.DateType, false),
                DataTypes.createStructField("Time", DataTypes.TimestampType, false),
                DataTypes.createStructField("CO(GT)", DataTypes.IntegerType, false),
                DataTypes.createStructField("PT08.S1(CO)", DataTypes.IntegerType, false)))

Dataset<Row> df = spark.read().format("csv").schema(schema).load("src/main/resources/AirQualityUCI/sample.csv");

Its producing below error as:

Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException
    at java.sql.Date.valueOf(Unknown Source)
    at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)

I feel that it is due to time format error. What are the ways of converting them into specific formats or changes to be made into StructType for its proper meaning?

The format I expect is in form of hh:mm:ss as it will be helpful via spark sql to convert it into timestamp format by concatenating columns.

2016-10-03, 18:00:00, 2, 6
Shaido
  • 27,497
  • 23
  • 70
  • 73
Utkarsh Saraf
  • 475
  • 8
  • 31

1 Answers1

1

If you read both Date and Time as string, then you can easily merge and convert them to a Timestamp. You do not need to change "." to a ":" in the Time column as the format can be specified when creating the Timestamp. Example of an solution in Scala:

val df = Seq(("2016-10-03", "00.00.17"),("2016-10-04", "00.01.17"))
  .toDF("Date", "Time")

val df2 = df.withColumn("DateTime", concat($"Date", lit(" "), $"Time"))
  .withColumn("Timestamp", unix_timestamp($"DateTime", "yyyy-MM-dd HH.mm.ss"))

Which will give you:

+----------+--------+-------------------+----------+
|      Date|    Time|           DateTime| Timestamp|
+----------+--------+-------------------+----------+
|2016-10-03|00.00.17|2016-10-03 00.00.17|1475424017|
|2016-10-04|00.01.17|2016-10-04 00.01.17|1475510477|
+----------+--------+-------------------+----------+

Of course, if you want you can still convert the Time column to use ":" instead of ".". It can be done by using regexp_replace:

df.withColumn("Time2", regexp_replace($"Time", "\\.", ":"))

If you do this before converting to a Timestamp, you need to change the specified format above.

Shaido
  • 27,497
  • 23
  • 70
  • 73