I have a spark DataFrame imported from a CSV file. After applying some manipulations (mainly deleting columns/rows), I try to save the new DataFrame to Hadoop which brings up an error message:
ValueError: year out of range
I suspect that some columns of type DateType or TimestampType are corrupted. At least in one column I found an entry with a year '207' - this seems to create issues.
**How can I check if the DataFrame adheres to the required time ranges?
I thought about writing a function that takes the DataFrame and gets for each DateType / TimestampType-Column the minimum and the maximum of values, but I cannot get this to work.**
Any ideas?
PS: In my understanding, spark would always check and enforce the schema. Would this not include a check for minimum/maximum values?