2

I have dataframe, which is having few column which is having date data. I want to apply validation on column and if wrong date comes i want to update that dataframe with error message column. I have tried but not working correctly. My sample dataframe data.

+-------+-----+-----------+-------------+
|AirName|Place|TakeoffDate|arriveoffDate|
+-------+-----+-----------+-------------+
|  Delta|  Aus|   11/16/18|     08/06/19|
|  Delta|  Pak|   11/16/18|     08/06/19|
| Vistra|  New|   11/16/18|     15/06/19|
|  Delta|  Aus|   15/16/18|     08/06/19|
| JetAir|  Aus|   11/16/18|         null|
+-------+-----+-----------+-------------+

I have tried below code.

val DATE_TIME_FORMAT = "MM-dd-yy"

  def validateDf(row: Row): Boolean = try {
    //assume row.getString(1) with give Datetime string
    java.time.LocalDateTime.parse(row.getString(2), java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT))
    true
  } catch {
    case ex: java.time.format.DateTimeParseException => {
      // Handle exception if you want
      false
    }
  }

val validDf = sample1.filter(validateDf(_))
val inValidDf = sample1.except(validDf)

expected dataframe

+-------+-----+-----------+-------------+-------------+
|AirName|Place|TakeoffDate|arriveoffDate|error message|
+-------+-----+-----------+-------------+-------------+
|  Delta|  Aus|   11/16/18|     08/06/19|             |
|  Delta|  Pak|   11/16/18|     08/06/19|             |
| Vistra|  New|   11/16/18|     15/06/19|wrong date   |
|  Delta|  Aus|   15/16/18|     08/06/19|wrong date   |
| JetAir|  Aus|   11/16/18|         null|             |
+-------+-----+-----------+-------------+-------------+
Lasf
  • 2,536
  • 1
  • 16
  • 35
Praveen
  • 55
  • 7
  • Try to use SimpleDateFormat(DATE_TIME_FORMAT).parse(row.getString(2)) – gekomad Jun 04 '19 at 14:18
  • What is your invalid result? – Krzysztof Atłasik Jun 04 '19 at 14:54
  • @KrzysztofAtłasik java.time.LocalDate.parse(row.getString(2), java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT)) . Here my data present in excel format and excel column by default type if Date type but here i am trying to parsing getString string.Do you know how to write same thing with row.getDate option. – Praveen Jun 06 '19 at 08:20

1 Answers1

2

I suggest using a User-Defined Function (UDF).

Here's an example:

Test Dataframe

val someDF = Seq(
  ("11/16/18", "Aus"),
  ("15/16/18", "Pak"),
  ("11/16/18", "New")
).toDF("TakeoffDate", "Place")

UDF

import org.apache.spark.sql.functions.udf

def isValidDate = udf((A: String) => {

  val DATE_TIME_FORMAT = "MM/dd/yy"

  try{
    java.time.LocalDate.parse(A, java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT))
    true
  } catch {
    case ex: java.time.format.DateTimeParseException => 
      false         
  }
})

Please note I am using LocalDate instead of LocalDateTime.

Usage:

someDF.withColumn("IsValidDate", isValidDate(someDF("TakeoffDate"))).show()

Result:

+-----------+-----+-----------+
|TakeoffDate|Place|IsValidDate|
+-----------+-----+-----------+
|   11/16/18|  Aus|       true|
|   15/16/18|  Pak|      false|
|   11/16/18|  New|       true|
+-----------+-----+-----------+

Hope it helps.

Regards.

  • Thanks for reply but i missed one thing to tell you, i am reading data from excel file. you know if any column having date value it by default have date format not a string format. I was able to handle it with string type. Hope you understood my point. – Praveen Jun 06 '19 at 08:18
  • Not sure I understood it correctly but maybe if you define a schema when you read the excel file? Maybe [this answer](https://stackoverflow.com/questions/39926411/provide-schema-while-reading-csv-file-as-a-dataframe) will help you. – Carlos David Peña Jun 06 '19 at 14:12