-1

I tried to convert the date format 2018-07-12 to ddMMyy using to_date but i get null after converting the dateformat

df = spark.createDataFrame([('2018-07-12',)], ['Date_col'])

df = df.withColumn('new_date',to_date('Date_col', 'ddMMyy'))

I need to use this logic to convert the dataframe column. I am new to the spark programming and tried lot of solutions but nothing helps.

I need to concat the ddMMyy from one column and hhss from other column

Any help please?

user10813834
  • 43
  • 1
  • 3
  • 11
  • can you please specify the final output you are looking for? – Nadim Younes Jan 14 '20 at 13:52
  • the final should be 120718 in the above case – user10813834 Jan 14 '20 at 13:59
  • 1
    Possible duplicate of [Pyspark changing type of column from date to string](https://stackoverflow.com/questions/32977360/pyspark-changing-type-of-column-from-date-to-string) and [pyspark convert dataframe column from timestamp to string of “YYYY-MM-DD” format](https://stackoverflow.com/questions/48910511/pyspark-convert-dataframe-column-from-timestamp-to-string-of-yyyy-mm-dd-format). – pault Jan 14 '20 at 14:53
  • Also based on the sentence *I need to concat the ddMMyy from one column and hhss from other column*, it seems that this could be an [XY Problem](http://www.xyproblem.info). There may be a better solution but it's hard to tell without seeing a [mcve]. – pault Jan 14 '20 at 14:58

2 Answers2

4

You can use the built-in pyspark.sql.functions.date_format(date, format) function:

from pyspark.sql.functions import date_format

df = spark.createDataFrame([('2018-07-12',)], ['dt'])
df.select(date_format('dt', 'ddMMyy').alias('date')).collect()

Results:

results

Reference: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dateformat#pyspark.sql.functions.date_format

Foxan Ng
  • 6,883
  • 4
  • 34
  • 41
3

First of all let's create DataFrame

df = spark.createDataFrame([('2018-07-12',)], ['Date_col'])
df.show()
+----------+
|  Date_col|
+----------+
|2018-07-12|
+----------+

Then we will define UDF function for that.

from datetime import datetime
import pyspark.sql.types as T
import pyspark.sql.functions as F


def user_defined_timestamp(date_col):
    _date = datetime.strptime(date_col, '%Y-%m-%d')
    return _date.strftime('%d%m%y')

user_defined_timestamp_udf = F.udf(user_defined_timestamp, T.StringType())

And at the end we will apply our functions on DateFrame in order to create column we want.

df = df.withColumn('new_date', user_defined_timestamp_udf('Date_col'))
df.show()
+----------+--------+
|  Date_col|new_date|
+----------+--------+
|2018-07-12|  120718|
+----------+--------+
lukaszKielar
  • 531
  • 3
  • 6
  • Thanks for this I will try it. Just a question.. why are we actually define UDF function for the conversion of date ? Do i need define the function whenever i convert the date? – user10813834 Jan 14 '20 at 14:42
  • You are able to create any function you want using UDF. I don't know if you can use custom timestamp while using `to_date` pyspark function. – lukaszKielar Jan 14 '20 at 14:53
  • 2
    [Do not use a `udf` for this](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance). You can use [`date_format`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.date_format) as shown in [this post](https://stackoverflow.com/questions/32977360/pyspark-changing-type-of-column-from-date-to-string). – pault Jan 14 '20 at 14:55