7

I have a dataframe with some columns:

+------------+--------+----------+----------+
|country_name| ID_user|birth_date|      psdt|
+------------+--------+----------+----------+
|      Россия|16460783|       486|1970-01-01|
|      Россия|16467391|      4669|1970-01-01|
|      Россия|16467889|      6861|1970-01-01|
|   Казахстан|16468013|      5360|1970-01-01|
|      Россия|16471027|      6311|1970-01-01|
|      Россия|16474162|      5567|1970-01-01|
|      Россия|16476386|      4351|1970-01-01|
|      Россия|16481067|      3831|1970-01-01|
|   Казахстан|16485965|     -2369|1970-01-01|
|    Германия|16486027|      5864|1970-01-01|
+------------+--------+----------+----------+
only showing top 10 rows

I need to add "psdt" with "birth_date". I wrote this code, but (sf.date_add) doesn't work:

resultbirthDF =(
        resultDF
        .select(sf.col("country_name"),
                sf.col("ID_user"),
                sf.col("birth_date"),
                sf.lit(past_datetr).alias("psdt")
               )
        .withColumn("birth_datetrue",sf.date_add(sf.to_date(sf.col("psdt")),sf.col("birth_date")))
    ).show(10)

'Column' object is not callable
Traceback (most recent call last):
  File "/volumes/disk1/yarn/local/usercache/livy/appcache/application_1573843665329_0786/container_e05_1573843665329_0786_01_000001/pyspark.zip/pyspark/sql/functions.py", line 1006, in date_add
    return Column(sc._jvm.functions.date_add(_to_java_column(start), days))

How to solve this problem?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • What is the error you are getting and what have you done to solve this problem? – Simas Joneliunas Nov 26 '19 at 01:34
  • Does this answer your question? [TypeError: 'Column' object is not callable using WithColumn](https://stackoverflow.com/questions/48305443/typeerror-column-object-is-not-callable-using-withcolumn) – vs97 Nov 26 '19 at 01:36
  • What is `sf`? dataframe? then why are you calling `sf.date_add` and `sf.to_date`? – Lamanus Nov 26 '19 at 04:56
  • @Lamanus `sf` is `pyspark.sql.functions` – pissall Nov 26 '19 at 08:15
  • @vs97 This is not at all a duplicate of the mentioned question – pissall Nov 26 '19 at 08:16
  • Possible duplicate of [Using a column value as a parameter to a spark DataFrame function](https://stackoverflow.com/questions/51140470/using-a-column-value-as-a-parameter-to-a-spark-dataframe-function) – pault Nov 26 '19 at 20:20

1 Answers1

12

From Pyspark Documentation

pyspark.sql.functions.date_add(start, days)

Returns the date that is days days after start

>>> df = spark.createDataFrame([('2015-04-08',)], ['dt'])
>>> df.select(date_add(df.dt, 1).alias('next_date')).collect()
[Row(next_date=datetime.date(2015, 4, 9))]

Try changing your code to sf.date_add(sf.to_date(sf.col("psdt")), 10) and see if 10 days get added. date_add expects the first argument to be a column and the second argument to be an integer (for the number of days you want to add to the column).

You can do exactly what you want to do without a UDF, but using a SQL expression as follows:

df.withColumn("birth_datetrue", expr("date_add(psdt, birth_date)"))

Sample output:

df.withColumn("new_dt", F.expr("date_add(dt, add)")).show()

+----------+---+----------+
|        dt|add|    new_dt|
+----------+---+----------+
|2015-04-08|  1|2015-04-09|
|2015-04-09|  2|2015-04-11|
|2015-04-10|  3|2015-04-13|
+----------+---+----------+
pissall
  • 7,109
  • 2
  • 25
  • 45