0

I have a column which has values like 'Jan 2018', 'Mar 2019', 'Dec 2016'. I want to convert this to date type(MMM yyyy). When I do it using pyspark, the dataframe result includes the date also- like (2018,1,1). How to get rid of the date?

from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import to_date


conf = SparkConf().setMaster("local").setAppName("Date")
sc=SparkContext(conf=conf)
spark=SparkSession(sc)


df = spark.createDataFrame([('Jan 2018',)], ['Month_Year'])
df1 = df.select(to_date(df.Month_Year, 'MMM yyyy').alias('dt')).collect()

print(df1)

Output: dt=datetime.date(2018,1,1)

My expected output is (2018,1) or (Jan 2018) or (1,2018) i.e. only month and year

Rebecca
  • 3
  • 1

1 Answers1

0

to_date function converts the string/timestamp/date types to yyyy-MM-dd format.

For your expected result use date_format() function to specify the format.

print(df.select(date_format(to_date(df.Month_Year, 'MMM yyyy'),"yyyy,MM").alias('dt')).collect())
#[Row(dt=u'2018,01')]

print(df.select(date_format(to_date(df.Month_Year, 'MMM yyyy'),"M,yyyy").alias('dt')).collect())
#[Row(dt=u'1,2018')]

print(df.select(date_format(to_date(df.Month_Year, 'MMM yyyy'),"MMM yyyy").alias('dt')).collect())
#[Row(dt=u'Jan 2018')]
notNull
  • 30,258
  • 4
  • 35
  • 50