0

I have a Hive query that I want to move across to a PySpark script - part of this query involves converting a date column to week of the year.

In both cases I do this with the following line in the select part of an SQL statement. In Hive I run the statement directly, from PySpark, I run it using spark.sql(statement)

DATE_FORMAT(from_unixtime(unix_timestamp(dt, 'yyyyMMdd')), 'Y-ww') 

Where dt contains the datetime in the yyyyMMdd format.

I want the first day of the week to be taken as Monday. This works fine in Hive:

hive> SELECT DATE_FORMAT(from_unixtime(unix_timestamp('20230611, 'yyyyMMdd')), 'Y-ww');
> 2023-23

But in Spark, it takes Sunday as the first day of the week

spark.sql("SELECT DATE_FORMAT(from_unixtime(unix_timestamp('20230611, 'yyyyMMdd')), 'Y-ww')").show()
2023-24 

Is there anyway I can get the spark sql behaviour to be the same as Hive, with weeks starting on a Monday.

The $LANG environment variable on the machine is set to en_GB.UTF-8

James
  • 3,252
  • 1
  • 18
  • 33
  • 1
    I don't know if DATE_FORMAT supports this. I believe the `weekofyear` function does use ISO 8601 weeks, even though the rest of Spark uses Sunday-based ones. There are some tips on this page as well: https://mungingdata.com/apache-spark/week-end-start-dayofweek-next-day – Mark Reed Jun 22 '23 at 14:55

2 Answers2

1

Try with weekofyear function with timeParserPolicy as LEGACY.

Example:

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
spark.sql("""SELECT concat_ws("-",date_format(to_date('20230611', 'yyyyMMdd'),'y'),weekofyear(from_unixtime(unix_timestamp('20230611', 'yyyyMMdd')))) as res""").show(10,False)
#+-------+
#|res    |
#+-------+
#|2023-23|
#+-------+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • This is close, but it will get the year wrong at the start of (most) years. For example spark.sql("SELECT concat_ws('-',date_format(to_date('20230101','yyyyMMdd'),'y'),weekofyear(from_unixtime(unix_timestamp('20230101', 'yyyyMMdd'))))").show() returns 2023-52 – James Jun 23 '23 at 12:07
0

I extended @notNull s answer by combining it with this answer :

https://stackoverflow.com/a/70510233/2979576

to get the ISO 8601 year and week, which then makes Spark return the same result as Hive. This includes the special cases at the beginning and end of the year, where the calendar year may not be the same as the year that the week belongs to:

SELECT concat_ws(
"-",
date_format(date_add(to_date('20220105', 'yyyyMMdd'), 4 - ((dayofweek(to_date('20220105', 'yyyyMMdd')) +5 ) %7 + 1 )) ,'y')
,lpad(weekofyear(to_date('20220105', 'yyyyMMdd')), 2, '0')
) as year_week

Be aware that your locale settings may affect the result, and newer versions of Spark might have a cleaner solution

James
  • 3,252
  • 1
  • 18
  • 33