Pyspark Dataframe: Check if values in date columns are valid

Question

I have a spark DataFrame imported from a CSV file. After applying some manipulations (mainly deleting columns/rows), I try to save the new DataFrame to Hadoop which brings up an error message:

ValueError: year out of range

I suspect that some columns of type DateType or TimestampType are corrupted. At least in one column I found an entry with a year '207' - this seems to create issues.

**How can I check if the DataFrame adheres to the required time ranges?

I thought about writing a function that takes the DataFrame and gets for each DateType / TimestampType-Column the minimum and the maximum of values, but I cannot get this to work.**

Any ideas?

PS: In my understanding, spark would always check and enforce the schema. Would this not include a check for minimum/maximum values?

score 0 · Accepted Answer · answered Aug 27 '18 at 13:11

For validating the date, regular Expressions can help.

for example: to validate a date with date format MM-dd-yyyy

step1: make a regular expression for your date format. for MM-dd-yyyy it will be ^(0[1-9]|[12][0-9]|3[01])[- \/.](0[1-9]|1[012])[- \/.](19|20)\d\d$

You can use this code for implementation.

This step will help finding invalid dates which wont parse and cause error.

step2: convert the string to date. the following code can help

import scala.util.{Try, Failure}
import org.apache.spark.sql.functions.udf

object FormatChecker extends java.io.Serializable {
  val fmt = org.joda.time.format.DateTimeFormat forPattern "MM-dd-yyyy"
  def invalidFormat(s: String) = Try(fmt parseDateTime s) match {
    case Failure(_) => true
    case _ => false
  }
}

val df = sc.parallelize(Seq(
    "01-02-2015", "99-03-2010", "---", "2015-01-01", "03-30-2001")
).toDF("date")

invalidFormat = udf((s: String) => FormatChecker.invalidFormat(s))
df.where(invalidFormat($"date")).count()

Hi, thanks a lot for your effort - I get your main idea. Unfortunately my Scala knowledge is very limited - how would that work in Python/PySpark? I think java.io.Serializable is not available, is it? — RaspyVotan, Aug 27 '18 at 14:09

Pyspark Dataframe: Check if values in date columns are valid

1 Answers1