-2

I have very little experience in Pyspark and I am trying with no success to create 3 new columns from a column that contain the timestamp of each row.

The column containing the date has the following format:EEE MMM dd HH:mm:ss Z yyyy. So it looks like this:

+--------------------+
|           timestamp|
+--------------------+
|Fri Oct 18 17:07:...|
|Mon Oct 21 21:49:...|
|Thu Oct 31 18:03:...|
|Sun Oct 20 15:00:...|
|Mon Sep 30 23:35:...|
+--------------------+

The 3 columns have to contain: the day of the week as an integer (so 0 for monday, 1 for tuesday...), the number of the month and the year. What is the most effective way to create these additional 3 columns and append them to the pyspark dataframe? Thanks in advance!!

Lamanus
  • 12,898
  • 4
  • 21
  • 47
  • 1
    what have you tried? please read [ask] – mck Jun 18 '21 at 13:01
  • please read as well: [how-to-make-good-reproducible-apache-spark-examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – Hansanho Jun 18 '21 at 17:33

1 Answers1

3

Spark 1.5 and higher has many date processing functions. Here are some that maybe useful for you

from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
df = df.withColumn('dayOfWeek', dayofweek(col('your_date_column')))
df = df.withColumn('month', month(col('your_date_column')))
df = df.withColumn('year', year(col('your_date_column')))
XXavier
  • 1,206
  • 1
  • 10
  • 14