0

My data frame looks like -

id            date             
1    2018-08-23 11:48:22       
2    2019-05-03 06:22:01       
3    2019-05-13 10:12:15       
4    2019-01-22 16:13:29       
5    2018-11-27 11:17:19   

My expected output is -

id            date             date1
1    2018-08-23 11:48:22       2018-08
2    2019-05-03 06:22:01       2019-05
3    2019-05-13 10:12:15       2019-05
4    2019-01-22 16:13:29       2019-01
5    2018-11-27 11:17:19       2018-11

How to do it in pyspark?

Prathik Kini
  • 1,067
  • 11
  • 25
Nikita Agarwal
  • 343
  • 1
  • 3
  • 13
  • Possible duplicate of [Convert pyspark string to date format](https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format) – pault Aug 12 '19 at 14:49

3 Answers3

1

I think you are trying to drop day and time details, you can use date_format function for it

>>> df.show()
+---+-------------------+
| id|               date|
+---+-------------------+
|  1|2018-08-23 11:48:22|
|  2|2019-05-03 06:22:01|
|  3|2019-05-13 10:12:15|
|  4|2019-01-22 16:13:29|
|  5|2018-11-27 11:17:19|
+---+-------------------+
>>> import pyspark.sql.functions as F
>>>
>>> df.withColumn('date1',F.date_format(F.to_date('date','yyyy-MM-dd HH:mm:ss'),'yyyy-MM')).show()
+---+-------------------+-------+
| id|               date|  date1|
+---+-------------------+-------+
|  1|2018-08-23 11:48:22|2018-08|
|  2|2019-05-03 06:22:01|2019-05|
|  3|2019-05-13 10:12:15|2019-05|
|  4|2019-01-22 16:13:29|2019-01|
|  5|2018-11-27 11:17:19|2018-11|
+---+-------------------+-------+
pault
  • 41,343
  • 15
  • 107
  • 149
Ali Yesilli
  • 2,071
  • 13
  • 16
0

via to_date and then substr functions ... example:

import pyspark.sql.functions as F
import pyspark.sql.types as T

rawData = [(1, "2018-08-23 11:48:22"),
           (2, "2019-05-03 06:22:01"),
           (3, "2019-05-13 10:12:15")]

df = spark.createDataFrame(rawData).toDF("id","my_date")

df.withColumn("new_my_date",\
            F.substring(F.to_date(F.col("my_date")), 1,7))\
.show()

+---+-------------------+-----------+
| id|            my_date|new_my_date|
+---+-------------------+-----------+
|  1|2018-08-23 11:48:22|    2018-08|
|  2|2019-05-03 06:22:01|    2019-05|
|  3|2019-05-13 10:12:15|    2019-05|
+---+-------------------+-----------+

thePurplePython
  • 2,621
  • 1
  • 13
  • 34
0
import pyspark.sql.functions as F

split_col = F.split(df['date'], '-')
df = df.withColumn('year', split_col.getItem(0)).withColumn('month', split_col.getItem(1))
df = df.select(F.concat(df['year'], F.lit('-'),df['month']).alias('year_month'))
df.show()
+----------+
|year_month|
+----------+
|   2018-08|
|   2019-05|
|   2019-05|
|   2019-01|
|   2018-11|
+----------+
Prathik Kini
  • 1,067
  • 11
  • 25