1

Have a spark data frame . One of the col has dates populated in the format like 2018-Jan-12

I need to change this structure to 20180112

How can this be achieved

Heether
  • 152
  • 1
  • 1
  • 6
  • One way is to use a `udf` like in [the answers to this question](https://stackoverflow.com/questions/41392303/converting-yyyymmdd-to-mm-dd-yyyy-format-in-pyspark). But the preferred way is probably to first [convert your string to a date](https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format) and then [convert the date back to a string in the desired format](https://stackoverflow.com/questions/32977360/pyspark-changing-type-of-column-from-date-to-string). – pault May 30 '18 at 14:25

3 Answers3

5

For Spark version 1.5+

Suppose you had the following DataFrame:

df = sqlCtx.createDataFrame([("2018-Jan-12",)], ["date_str"])
df.show()
#+-----------+
#|   date_str|
#+-----------+
#|2018-Jan-12|
#+-----------+

To avoid using udfs, you can first convert the string to a date:

from pyspark.sql.functions import from_unixtime, unix_timestamp
df = df.withColumn('date', from_unixtime(unix_timestamp('date_str', 'yyyy-MMM-dd')))
df.show()
#+-----------+-------------------+
#|   date_str|               date|
#+-----------+-------------------+
#|2018-Jan-12|2018-01-12 00:00:00|
#+-----------+-------------------+

Then format the date as a string in your desired format:

from pyspark.sql.functions import date_format, col
df = df.withColumn("new_date_str", date_format(col("date"), "yyyyMMdd"))
df.show()
#+-----------+-------------------+------------+
#|   date_str|               date|new_date_str|
#+-----------+-------------------+------------+
#|2018-Jan-12|2018-01-12 00:00:00|    20180112|
#+-----------+-------------------+------------+

Or if you prefer, you can chain it all together and skip the intermediate steps:

import pyspark.sql.functions as f
df.select(
    f.date_format(
        f.from_unixtime(
            f.unix_timestamp(
                'date_str',
                'yyyy-MMM-dd')
        ),
        "yyyyMMdd"
    ).alias("new_date_str")
).show()
#+------------+
#|new_date_str|
#+------------+
#|    20180112|
#+------------+
pault
  • 41,343
  • 15
  • 107
  • 149
4

You can use Pyspark UDF.

from pyspark.sql import functions as f
from pyspark.sql import types as t
from datetime.datetime import strftime, strptime

df = df.withColumn('date_col', f.udf(lambda d: strptime(d, '%Y-%b-%d').strftime('%Y%m%d'), t.StringType())(f.col('date_col')))

Or, you can define a large function to catch exceptions if needed.

def date_converter(col):
    try:
        _date = strptime(date_string, '%Y-%b-%d')
        str_date = _date.strftime('%Y%m%d')
        return str_date
    except Exception:
        # Some code if needed
        return ''

udf_function = f.udf(date_converter, t.StringType())

df = df.withColumn('date_col', udf_function(df.date_col))

note: I am assuming date_col is the name of your column.

Gocht
  • 9,924
  • 3
  • 42
  • 81
1

Investigate Python's datetime library, and the methods strftime() and strptime(): Basic date and time types: trftime() and strptime()

For example, using strftime.org as a reference:

from datetime import datetime

date_string = '2018-Jan-12'
# Assuming day of the month is a zero-padded decimal number
datetime_object = datetime.strptime(date_string, '%Y-%b-%d')
converted_date_string = datetime_object.strftime('%Y%m%d')
Benjamin Scholtz
  • 823
  • 6
  • 15