1

For some strange reason , the dayofmonth function in spark seems to return strange value for years 1500 or less.

Following were the results that was obtained ->

scala> spark.sql("SELECT dayofmonth('1501-02-14') ").show()
+------------------------------------+
|dayofmonth(CAST(1501-02-14 AS DATE))|
+------------------------------------+
|                                  14|
+------------------------------------+


scala> spark.sql("SELECT dayofmonth('1500-02-14') ").show()
+------------------------------------+
|dayofmonth(CAST(1500-02-14 AS DATE))|
+------------------------------------+
|                                  13|
+------------------------------------+


scala> spark.sql("SELECT dayofmonth('1400-02-14') ").show()
+------------------------------------+
|dayofmonth(CAST(1400-02-14 AS DATE))|
+------------------------------------+
|                                  12|
+------------------------------------+

Can anyone explain , why spark behaves this way?

eliasah
  • 39,588
  • 11
  • 124
  • 154
H.Aadhithya
  • 205
  • 1
  • 12

1 Answers1

6

That's because dates are exposed externally as java.sql.Date and are represented internally as the number of dates since the Unix epoch (1970-01-01).

References: source 1, source 2 and 3.

This mainly creates lots of issues when dealing with dates before 1970 but you can try creating udfs (I can't believe I'm writing this) with external libraries that might be able to cope with this problem as adviced here.

Reminder: Of course, you need to take into account performance bottlenecks using udfs. More on that here.

For more information about Unix Time, you can read the following :

https://en.wikipedia.org/wiki/Unix_time

eliasah
  • 39,588
  • 11
  • 124
  • 154