I have a Hive query that I want to move across to a PySpark script - part of this query involves converting a date column to week of the year.
In both cases I do this with the following line in the select part of an SQL statement. In Hive I run the statement directly, from PySpark, I run it using spark.sql(statement)
DATE_FORMAT(from_unixtime(unix_timestamp(dt, 'yyyyMMdd')), 'Y-ww')
Where dt contains the datetime in the yyyyMMdd format.
I want the first day of the week to be taken as Monday. This works fine in Hive:
hive> SELECT DATE_FORMAT(from_unixtime(unix_timestamp('20230611, 'yyyyMMdd')), 'Y-ww');
> 2023-23
But in Spark, it takes Sunday as the first day of the week
spark.sql("SELECT DATE_FORMAT(from_unixtime(unix_timestamp('20230611, 'yyyyMMdd')), 'Y-ww')").show()
2023-24
Is there anyway I can get the spark sql behaviour to be the same as Hive, with weeks starting on a Monday.
The $LANG environment variable on the machine is set to en_GB.UTF-8