Convert date from String to Date format in Dataframes

Question

I am trying to convert a column which is in String format to Date format using the to_date function but its returning Null values.

df.createOrReplaceTempView("incidents")
spark.sql("select Date from incidents").show()

+----------+
|      Date|
+----------+
|08/26/2016|
|08/26/2016|
|08/26/2016|
|06/14/2016|

spark.sql("select to_date(Date) from incidents").show()

+---------------------------+
|to_date(CAST(Date AS DATE))|
 +---------------------------+
|                       null|
|                       null|
|                       null|
|                       null|

The Date column is in String format:

 |-- Date: string (nullable = true)

Try this http://stackoverflow.com/questions/17134716/convert-dataframe-column-type-from-string-to-datetime — Piyush S. Wanare, Nov 23 '16 at 12:02
Try this `raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')` — Piyush S. Wanare, Nov 23 '16 at 12:03
Both of these comments point to answers using pandas data frames, not Spark data frames. While these data frame formats are interchangeable, conversion to pandas is costly on large data sets and negates many of the benefits that Spark provides (like being able to run a conversion on a distributed Spark cluster). — QA Collective, Mar 08 '19 at 01:36

score 65 · Accepted Answer · edited Nov 17 '18 at 03:23

65

Use to_date with Java SimpleDateFormat.

TO_DATE(CAST(UNIX_TIMESTAMP(date, 'MM/dd/yyyy') AS TIMESTAMP))

Example:

spark.sql("""
  SELECT TO_DATE(CAST(UNIX_TIMESTAMP('08/26/2016', 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate"""
).show()

+----------+
|        dt|
+----------+
|2016-08-26|
+----------+

edited Nov 17 '18 at 03:23

MichaelChirico

33,841
14
113
198

answered Nov 23 '16 at 11:58

14

I've found `to_date(my_string_column, 'yyyyMMdd') AS my_date_column` to work just fine in `Spark 2.3.2`; of course you can substitute your own *date-format* in place of `yyyyMMdd` – y2k-shubham Feb 12 '19 at 12:13
1

The below link answer works https://forums.databricks.com/answers/12121/view.html df.withColumn("tx_date", to_date(unix_timestamp($"date", "M/dd/yyyy").cast("timestamp"))) – Binu Oct 22 '19 at 09:53

score 38 · Answer 2 · answered Nov 25 '16 at 08:25

I solved the same problem without the temp table/view and with dataframe functions.

Of course I found that only one format works with this solution and that's yyyy-MM-DD.

For example:

val df = sc.parallelize(Seq("2016-08-26")).toDF("Id")
val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
val df3 = df2.withColumn("Date", (col("Id").cast("date")))

df3.printSchema

root
 |-- Id: string (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Date: date (nullable = true)

df3.show

+----------+--------------------+----------+
|        Id|           Timestamp|      Date|
+----------+--------------------+----------+
|2016-08-26|2016-08-26 00:00:...|2016-08-26|
+----------+--------------------+----------+

The timestamp of course has 00:00:00.0 as a time value.

score 25 · Answer 3 · answered Sep 21 '17 at 07:01

25

Since your main aim was to convert the type of a column in a DataFrame from String to Timestamp, I think this approach would be better.

import org.apache.spark.sql.functions.{to_date, to_timestamp}
val modifiedDF = DF.withColumn("Date", to_date($"Date", "MM/dd/yyyy"))

You could also use to_timestamp (I think this is available from Spark 2.x) if you require fine grained timestamp.

answered Sep 21 '17 at 07:01

Sai Kiriti Badam

950
16
15

5

Both functions available only since 2.2 version. – vvg Jan 19 '18 at 14:37
1

ys `to_timestamp` works fine.. https://sparkbyexamples.com/spark/spark-convert-string-to-timestamp-format/ – prashanth May 19 '20 at 08:21

score 8 · Answer 4 · answered Jun 12 '17 at 10:10

8

you can also do this query...!

sqlContext.sql("""
select from_unixtime(unix_timestamp('08/26/2016', 'MM/dd/yyyy'), 'yyyy:MM:dd') as new_format
""").show()

answered Jun 12 '17 at 10:10

Curycu

1,545
17
21

Gabber · Answer 5 · 2018-07-16T17:56:05.340

5

You can also pass date format

df.withColumn("Date",to_date(unix_timestamp(df.col("your_date_column"), "your_date_format").cast("timestamp")))

For Example

import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq("06 Jul 2018")).toDF("dateCol")
df.withColumn("Date",to_date(unix_timestamp(df.col("dateCol"), "dd MMM yyyy").cast("timestamp")))

edited Jul 16 '18 at 17:56

answered Jul 11 '18 at 12:36

Gabber

7,169
3
32
46

error: value to_date is not a member of org.apache.spark.sql.DataFrame : Kindly suggest – MapReddy Usthili Jul 16 '18 at 05:34
1

@MapReddy please import org.apache.spark.sql.functions._ – Gabber Jul 16 '18 at 09:47
@AmitDubey , I just need the timestamp with "yyyy-MM-dd HH" like this as hours how can do it ? If I do lit(to_timestamp(current_timestamp(), "yyyy-MM-dd HH"))) like this , it is giving like "2018-11-26 02:36:26" ... how can I do this in "yyyy-MM-dd HH" format? – BdEngineer Nov 26 '18 at 07:40

Minkymorgan · Answer 6 · 2019-04-02T12:46:35.363

I have personally found some errors in when using unix_timestamp based date converstions from dd-MMM-yyyy format to yyyy-mm-dd, using spark 1.6, but this may extend into recent versions. Below I explain a way to solve the problem using java.time that should work in all versions of spark:

I've seen errors when doing:

from_unixtime(unix_timestamp(StockMarketClosingDate, 'dd-MMM-yyyy'), 'yyyy-MM-dd') as FormattedDate

Below is code to illustrate the error, and my solution to fix it. First I read in stock market data, in a common standard file format:

    import sys.process._
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}
    import sqlContext.implicits._

    val EODSchema = StructType(Array(
        StructField("Symbol"                , StringType, true),     //$1       
        StructField("Date"                  , StringType, true),     //$2       
        StructField("Open"                  , StringType, true),     //$3       
        StructField("High"                  , StringType, true),     //$4
        StructField("Low"                   , StringType, true),     //$5
        StructField("Close"                 , StringType, true),     //$6
        StructField("Volume"                , StringType, true)      //$7
        ))

    val textFileName = "/user/feeds/eoddata/INDEX/INDEX_19*.csv"

    // below is code to read using later versions of spark
    //val eoddata = spark.read.format("csv").option("sep", ",").schema(EODSchema).option("header", "true").load(textFileName)


    // here is code to read using 1.6, via, "com.databricks:spark-csv_2.10:1.2.0"

    val eoddata = sqlContext.read
                               .format("com.databricks.spark.csv")
                               .option("header", "true")                               // Use first line of all files as header
                               .option("delimiter", ",")                               //.option("dateFormat", "dd-MMM-yyyy") failed to work
                               .schema(EODSchema)
                               .load(textFileName)

    eoddata.registerTempTable("eoddata")

And here is the date conversions having issues:

%sql 
-- notice there are errors around the turn of the year
Select 
    e.Date as StringDate
,   cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)  as ProperDate
,   e.Close
from eoddata e
where e.Symbol = 'SPX.IDX'
order by cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)
limit 1000

A chart made in zeppelin shows spikes, which are errors.

and here is the check that shows the date conversion errors:

// shows the unix_timestamp conversion approach can create errors
val result =  sqlContext.sql("""
Select errors.* from
(
    Select 
    t.*
    , substring(t.OriginalStringDate, 8, 11) as String_Year_yyyy 
    , substring(t.ConvertedCloseDate, 0, 4)  as Converted_Date_Year_yyyy
    from
    (        Select
                Symbol
            ,   cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)  as ConvertedCloseDate
            ,   e.Date as OriginalStringDate
            ,   Close
            from eoddata e
            where e.Symbol = 'SPX.IDX'
    ) t 
) errors
where String_Year_yyyy <> Converted_Date_Year_yyyy
""")


//df.withColumn("tx_date", to_date(unix_timestamp($"date", "M/dd/yyyy").cast("timestamp")))


result.registerTempTable("SPX")
result.cache()
result.show(100)
result: org.apache.spark.sql.DataFrame = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
res53: result.type = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
+-------+------------------+------------------+-------+----------------+------------------------+
| Symbol|ConvertedCloseDate|OriginalStringDate|  Close|String_Year_yyyy|Converted_Date_Year_yyyy|
+-------+------------------+------------------+-------+----------------+------------------------+
|SPX.IDX|        1997-12-30|       30-Dec-1996| 753.85|            1996|                    1997|
|SPX.IDX|        1997-12-31|       31-Dec-1996| 740.74|            1996|                    1997|
|SPX.IDX|        1998-12-29|       29-Dec-1997| 953.36|            1997|                    1998|
|SPX.IDX|        1998-12-30|       30-Dec-1997| 970.84|            1997|                    1998|
|SPX.IDX|        1998-12-31|       31-Dec-1997| 970.43|            1997|                    1998|
|SPX.IDX|        1998-01-01|       01-Jan-1999|1229.23|            1999|                    1998|
+-------+------------------+------------------+-------+----------------+------------------------+
FINISHED

After this result, I switched to java.time conversions with a UDF like this, which worked for me:

// now we will create a UDF that uses the very nice java.time library to properly convert the silly stockmarket dates
// start by importing the specific java.time libraries that superceded the joda.time ones
import java.time.LocalDate
import java.time.format.DateTimeFormatter

// now define a specific data conversion function we want

def fromEODDate (YourStringDate: String): String = {

    val formatter = DateTimeFormatter.ofPattern("dd-MMM-yyyy")
    var   retDate = LocalDate.parse(YourStringDate, formatter)

    // this should return a proper yyyy-MM-dd date from the silly dd-MMM-yyyy formats
    // now we format this true local date with a formatter to the desired yyyy-MM-dd format

    val retStringDate = retDate.format(DateTimeFormatter.ISO_LOCAL_DATE)
    return(retStringDate)
}

Now I register it as a function for use in sql:

sqlContext.udf.register("fromEODDate", fromEODDate(_:String))

and check the results, and rerun test:

val results = sqlContext.sql("""
    Select
        e.Symbol    as Symbol
    ,   e.Date      as OrigStringDate
    ,   Cast(fromEODDate(e.Date) as Date) as ConvertedDate
    ,   e.Open
    ,   e.High
    ,   e.Low
    ,   e.Close
    from eoddata e
    order by Cast(fromEODDate(e.Date) as Date)
""")

results.printSchema()
results.cache()
results.registerTempTable("results")
results.show(10)
results: org.apache.spark.sql.DataFrame = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
root
 |-- Symbol: string (nullable = true)
 |-- OrigStringDate: string (nullable = true)
 |-- ConvertedDate: date (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
res79: results.type = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
+--------+--------------+-------------+-------+-------+-------+-------+
|  Symbol|OrigStringDate|ConvertedDate|   Open|   High|    Low|  Close|
+--------+--------------+-------------+-------+-------+-------+-------+
|ADVA.IDX|   01-Jan-1996|   1996-01-01|    364|    364|    364|    364|
|ADVN.IDX|   01-Jan-1996|   1996-01-01|   1527|   1527|   1527|   1527|
|ADVQ.IDX|   01-Jan-1996|   1996-01-01|   1283|   1283|   1283|   1283|
|BANK.IDX|   01-Jan-1996|   1996-01-01|1009.41|1009.41|1009.41|1009.41|
| BKX.IDX|   01-Jan-1996|   1996-01-01|  39.39|  39.39|  39.39|  39.39|
|COMP.IDX|   01-Jan-1996|   1996-01-01|1052.13|1052.13|1052.13|1052.13|
| CPR.IDX|   01-Jan-1996|   1996-01-01|  1.261|  1.261|  1.261|  1.261|
|DECA.IDX|   01-Jan-1996|   1996-01-01|    205|    205|    205|    205|
|DECN.IDX|   01-Jan-1996|   1996-01-01|    825|    825|    825|    825|
|DECQ.IDX|   01-Jan-1996|   1996-01-01|    754|    754|    754|    754|
+--------+--------------+-------------+-------+-------+-------+-------+
only showing top 10 rows

which looks ok, and I rerun my chart, to see if there are errors/spikes:

As you can see, no more spikes or errors. I now use a UDF as I've shown to apply my date format transformations to a standard yyyy-MM-dd format, and have not had spurious errors since. :-)

score 3 · Answer 7 · answered Aug 20 '20 at 06:25

3

You could simply do df.withColumn("date", date_format(col("string"),"yyyy-MM-dd HH:mm:ss.ssssss")).show()

answered Aug 20 '20 at 06:25

ss301

514
9
22

score 1 · Answer 8 · edited Aug 27 '18 at 09:05

1

dateID is int column contains date in Int format

spark.sql("SELECT from_unixtime(unix_timestamp(cast(dateid as varchar(10)), 'yyyymmdd'), 'yyyy-mm-dd') from XYZ").show(50, false)

edited Aug 27 '18 at 09:05

Prasad Khode

6,602
11
44
59

answered Jun 17 '18 at 05:42

Rajiv Singh

958
1
9
14

score 1 · Answer 9 · answered Dec 16 '19 at 06:03

Find the below-mentioned code, it might be helpful for you.

   val stringDate = spark.sparkContext.parallelize(Seq("12/16/2019")).toDF("StringDate")
                    val dateCoversion = stringDate.withColumn("dateColumn", to_date(unix_timestamp($"StringDate", "dd/mm/yyyy").cast("Timestamp")))
                    dateCoversion.show(false)
+----------+----------+
|StringDate|dateColumn|
+----------+----------+
|12/16/2019|2019-01-12|
+----------+----------+

Shailesh · Answer 10 · 2022-07-26T05:27:28.217

1

This works in Spark SQL:
TO_DATE(date_string_or_column, 'yyyy-MM-dd') AS date_column_name. You can replace the second argument with however your date string is formatted, e.g. yyyy/MM/dd. The return type is date.

edited Jul 26 '22 at 05:27

answered Jan 14 '22 at 07:05

Shailesh

2,116
4
28
48

score 0 · Answer 11 · answered Feb 21 '20 at 12:09

Use below function in PySpark to convert datatype into your required datatype. Here I'm converting all the date datatype into the Timestamp column.

def change_dtype(df):
    for name, dtype in df.dtypes:
        if dtype == "date":
            df = df.withColumn(name, col(name).cast('timestamp'))
    return df

score 0 · Answer 12 · answered Dec 16 '22 at 13:53

When you try to change the string data type to date format when you have the string data in the format 'dd/MM/yyyy' with slashes and using spark version greater than 3.0 it converts the value to null.

In order for that to work you can set the spark configuration property which will allow you to get the output that you want.

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

and then we can use the below code to get the output that we want

df.withColumn("tx_date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))

score -1 · Answer 13 · answered Dec 07 '18 at 14:12

The solution proposed above by Sai Kiriti Badam worked for me.

I'm using Azure Databricks to read data captured from an EventHub. This contains a string column named EnqueuedTimeUtc with the following format...

12/7/2018 12:54:13 PM

I'm using a Python notebook and used the following...

import pyspark.sql.functions as func

sports_messages = sports_df.withColumn("EnqueuedTimestamp", func.to_timestamp("EnqueuedTimeUtc", "MM/dd/yyyy hh:mm:ss aaa"))

... to create a new column EnqueuedTimestamp of type "timestamp" with data in the following format...

2018-12-07 12:54:13

Convert date from String to Date format in Dataframes

13 Answers13

Linked

Related