30

I have a DataFrame with Timestamp column, which i need to convert as Date format.

Is there any Spark SQL functions available for this?

zero323
  • 322,348
  • 103
  • 959
  • 935
Shankar
  • 8,529
  • 26
  • 90
  • 159

5 Answers5

68

You can cast the column to date:

Scala:

import org.apache.spark.sql.types.DateType

val newDF = df.withColumn("dateColumn", df("timestampColumn").cast(DateType))

Pyspark:

df = df.withColumn('dateColumn', df['timestampColumn'].cast('date'))
Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • 5
    This isn't Spark SQL. – dslack Oct 12 '17 at 04:36
  • 7
    @dslack This solution uses functions available as part of the Spark SQL package, but it doesn't use the SQL language, instead it uses the robust DataFrame API, with SQL-like functions, instead of using less reliable strings with actual SQL queries. – Daniel de Paula Oct 12 '17 at 08:45
  • What is less reliable about SQL queries? – dslack Nov 05 '17 at 03:31
  • @dslack well, it all depends on the application. In general, if you goal is to produce a reliable and testable stable codebase, query strings are not very recommended, because they are harder to change, it's easier to make simple mistakes and it's less modularizable. – Daniel de Paula Nov 05 '17 at 08:56
  • Need help. reading Data from the database via jdbc. Oracle table has 15-DEC-2016 as a field with DATE datatype. Dataframe.printSchema() shows Timestamp. but when i print it, it shows all nulls. – Rudresh Ajgaonkar Jan 26 '18 at 15:09
  • @RudreshAjgaonkar One possible workaround is to cast the date to string when reading from the source database, then converting to TimestampType in Spark with unix_timestamp or an equivalent strategy. – Daniel de Paula Jan 26 '18 at 15:16
21

In SparkSQL:

SELECT
  CAST(the_ts AS DATE) AS the_date
FROM the_table
dslack
  • 835
  • 6
  • 17
6

Imagine the following input:

val dataIn = spark.createDataFrame(Seq(
        (1, "some data"),
        (2, "more data")))
    .toDF("id", "stuff")
    .withColumn("ts", current_timestamp())

dataIn.printSchema
root
 |-- id: integer (nullable = false)
 |-- stuff: string (nullable = true)
 |-- ts: timestamp (nullable = false)

You can use the to_date function:

val dataOut = dataIn.withColumn("date", to_date($"ts"))

dataOut.printSchema
root
 |-- id: integer (nullable = false)
 |-- stuff: string (nullable = true)
 |-- ts: timestamp (nullable = false)
 |-- date: date (nullable = false)

dataOut.show(false)
+---+---------+-----------------------+----------+
|id |stuff    |ts                     |date      |
+---+---------+-----------------------+----------+
|1  |some data|2017-11-21 16:37:15.828|2017-11-21|
|2  |more data|2017-11-21 16:37:15.828|2017-11-21|
+---+---------+-----------------------+----------+

I would recommend preferring these methods over casting and plain SQL.

Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154
3

For Spark 2.4+,

import spark.implicits._
val newDF = df.withColumn("dateColumn", $"timestampColumn".cast(DateType))    

OR

val newDF = df.withColumn("dateColumn", col("timestampColumn").cast(DateType))
skybutter
  • 96
  • 5
0

Best thing to use..tried and tested -

df_join_result.withColumn('order_date', df_join_result['order_date'].cast('date'))
Buddy
  • 10,874
  • 5
  • 41
  • 58