0

Suppose I have a dataframe df with a column birth_date which has values ('123','5345',234345') etc. I am reading the dataframe first from a csv using

df = sqlContext.read.csv('s3://path/to/file',header = TRUE)

Every column is read as StringType(), so I convert the birth_date column to LongType() first (I have to read it as LongType due to some other reasons, I know I can read it as Integer as well, but lets not go into that right now) using the following

df = df.withColumn('birth_date',df['birth_date'].cast(LongType()))

Now, how do I make birth_date column to DateType as well as add the interger values the column holds, as the number of days with the date "1960-01-01"?

I tried using date_add method date_add using the following command, but I am very new to pyspark and dont understand how column operations behave differently, so I am stuck.

Here is what I tried to do:

df= df.withColumn('birth_date',date_add("1960-01-01",'birth_date'))

and I am getting this error

py4j.Py4JException: Method date_add([class org.apache.spark.sql.Column, class java.lang.String]) does not exist

All my operations are in Databricks pyspark, if it matters at all.

Community
  • 1
  • 1
Gompu
  • 415
  • 1
  • 6
  • 21
  • Possible duplicate of [Using a column value as a parameter to a spark DataFrame function](https://stackoverflow.com/questions/51140470/using-a-column-value-as-a-parameter-to-a-spark-dataframe-function) – pault Mar 22 '19 at 16:47
  • See the linked post for details, but you can use `pyspark.sql.functions.expr` here: `df= df.withColumn('birth_date',expr("date_add('1960-01-01',birth_date)"))` – pault Mar 22 '19 at 16:48

1 Answers1

0

The problem is that the days argument of pyspark.sql.functions.date_add is expecting an integer, and you are giving it a column name. As said here, you can use pyspark.sql.functions.expr to use a column instead (and also a string instead of a column for the start argument as you have in your example):

from pyspark.sql.functions import expr
df= df.withColumn('birth_date', expr("date_add('1960-01-01', birth_date)"))
user2739472
  • 1,401
  • 17
  • 15