How do you find anomalies in format of a column in a spark dataframe?

Question

As the question says, I want to find anomalies in the format of the value in a column in a large dataset.

For example: if I have a date column within a dataset of say 500 million rows, I want to make sure that the date format for all rows in the column is MM-DD-YYYY. I want to find the count and the values where there is an anomaly in this format.

How do I do this? Can I use regex? Can someone give an example? Want to do this using Spark Dataframe.

score 3 · Accepted Answer · edited May 23 '17 at 12:14

Proper date format validation using regex can be tricky (See: Regex to validate date format dd/mm/yyyy), but you can use Joda-Time as below:

import scala.util.{Try, Failure}
import org.apache.spark.sql.functions.udf

object FormatChecker extends java.io.Serializable {
  val fmt = org.joda.time.format.DateTimeFormat forPattern "MM-dd-yyyy"
  def invalidFormat(s: String) = Try(fmt parseDateTime s) match {
    case Failure(_) => true
    case _ => false
  }
}

val df = sc.parallelize(Seq(
    "01-02-2015", "99-03-2010", "---", "2015-01-01", "03-30-2001")
).toDF("date")

invalidFormat = udf((s: String) => FormatChecker.invalidFormat(s))
df.where(invalidFormat($"date")).count()

How do you find anomalies in format of a column in a spark dataframe?

1 Answers1

Linked