As the question says, I want to find anomalies in the format of the value in a column in a large dataset.
For example: if I have a date column within a dataset of say 500 million rows, I want to make sure that the date format for all rows in the column is MM-DD-YYYY. I want to find the count and the values where there is an anomaly in this format.
How do I do this? Can I use regex? Can someone give an example? Want to do this using Spark Dataframe.
Asked
Active
Viewed 1,215 times
3

Enkay
- 83
- 1
- 6
1 Answers
3
Proper date format validation using regex can be tricky (See: Regex to validate date format dd/mm/yyyy), but you can use Joda-Time as below:
import scala.util.{Try, Failure}
import org.apache.spark.sql.functions.udf
object FormatChecker extends java.io.Serializable {
val fmt = org.joda.time.format.DateTimeFormat forPattern "MM-dd-yyyy"
def invalidFormat(s: String) = Try(fmt parseDateTime s) match {
case Failure(_) => true
case _ => false
}
}
val df = sc.parallelize(Seq(
"01-02-2015", "99-03-2010", "---", "2015-01-01", "03-30-2001")
).toDF("date")
invalidFormat = udf((s: String) => FormatChecker.invalidFormat(s))
df.where(invalidFormat($"date")).count()