4

I have a column 'start_date' which is an integer 37823. This happened when I used xlrd library to convert xlsx to csv. Hence '2003/07/21' got converted to 37823.

I have gone through xlrd documentation and I understand there are several ways to convert it to date. However, I need to convert this to date format using PySpark in AWS Glue ETL jobs. Any suggestions?

I tried using to_date, date_format functions, but nothing worked.

Raj
  • 613
  • 3
  • 9
  • 23
  • Possible duplicate of [How do I read a date in Excel format in Python?](https://stackoverflow.com/questions/1108428/how-do-i-read-a-date-in-excel-format-in-python) – Dylan Smith Aug 13 '18 at 21:20
  • Dylan - I guess its not. As I am not looking to convert it to date format during xlsx to csv conversion. I need to convert it later in the data pipeline. Is that possible? – Raj Aug 13 '18 at 21:26
  • Isn't this going to work for you -> https://stackoverflow.com/questions/2623156/how-to-convert-the-integer-date-format-into-yyyymmdd ? – Dawid_Sielski Aug 13 '18 at 21:30
  • Possible duplicate of [How to convert the integer date format into YYYYMMDD?](https://stackoverflow.com/questions/2623156/how-to-convert-the-integer-date-format-into-yyyymmdd) – udo Aug 13 '18 at 21:33
  • @Dawid_Sielski - I used the same function like this :df = df.withColumn('hire date', datetime.fromordinal(df['hire date']).strftime('%Y%m%d')) TypeError: an integer is required – Raj Aug 13 '18 at 22:56

3 Answers3

6

I was able to finally resolve this issue.

Using UDF:

def convert_date(x):
    mDt = datetime.datetime(1899, 12, 30)
    dlt = mDt + datetime.timedelta(days=x)
    return dlt.strftime("%Y-%m-%d")

convert_date_udf = udf(lambda z: convert_date(z), StringType())
df = df.withColumn('hire date', convert_date_udf('hire date').alias('hire date new'))

Without using UDF:

df = df.withColumn('hire date', F.expr("date_add(to_date('1899-12-30'), cast(`hire date` as int))").cast(StringType())

Hope it helps!

Raj
  • 613
  • 3
  • 9
  • 23
1

I suppose that there are more elegant ways to done this but this is what I have come up with for now.

from datetime import date
df.hire_date = df.hire_date.apply(date.fromordinal) # this will give you date in dash format
df.hire_date = df.hire_date.apply(lambda x: str(x).replace('-', '/')) # this will simply replace dash with slash

Hope that this works for you :)

Dawid_Sielski
  • 384
  • 2
  • 6
0

Looks like 12/30/1899 + 37823 days = 2003/07/21

You can use this date_add function below instead of UDF or python functions. UDFs are supposed to be a little slower than pyspark functions.

weird_date = 37823
df = spark.createDataFrame([('1899-12-30',)], ['dt'])
df.select(date_add(df.dt, weird_date).alias('converted_date')).collect()
Lucius Kaye
  • 91
  • 3
  • 6
  • Thanks Lucifer. I tried something like this: df2 = spark.createDataFrame([('1899-12-30',)], ['dt']) df = df.select(date_add(df2.dt, df['hire date']).alias('converted_date')).collect() print df.head() Got this error: TypeError: 'Column' object is not callable – Raj Aug 15 '18 at 15:26
  • I think it's because you used "df" before you declared it. – Lucius Kaye Aug 15 '18 at 17:15
  • Well I keep getting the error : Method date_add([class org.apache.spark.sql.Column, class java.lang.String]) does not exist. This is mycode: from pyspark.sql import functions as f df2 = spark.createDataFrame([('1899-12-30',)], ['dt']) df = df.select(f.date_add(df2.dt, 'hire date').alias('hire date')) – Raj Aug 15 '18 at 18:45
  • The date_add function takes (column, integer). You're trying to put (column, string). – Lucius Kaye Aug 15 '18 at 20:11
  • The 'hire date' column is of IntegerType. However, since it has a space, I am not able to use it without '' and when I use it with '', it is interpreted as string. – Raj Aug 15 '18 at 20:49
  • For columns with space, I had to use backticks (`) instead of single quotes ('). That worked for me. – Raj Aug 16 '18 at 21:09