2

I am calculating age from birth date in pyspark :

def run(first): 

    out = spark.sql("""
        SELECT 
           p.birth_date,
        FROM table1 p
        LEFT JOIN table2 a USING(id)
        LEFT JOIN table2 m ON m.id = p.id
        LEFT JOIN table4 i USING(id))"""
    
    out = out.withColumn('month', F.lit(first))

    out = out.withColumn('age',
             F.when(F.col('birth_date').isNull(), None).otherwise(
               F.floor(F.datediff(
                 F.col('month'), F.col('birth_date'))/365.25)))

I get the following error at this line:

F.col('month'), F.col('birth_date'))/365.25)))

TypeError: unsupported operand type(s) for -: 'DataFrame' and 'DataFrame'

Any ideas on how to resolve this ?

CyberPunk
  • 1,297
  • 5
  • 18
  • 35
  • could you paste your full stacktrace and code snippets, i dont think this is wrong – E.ZY. Nov 04 '20 at 06:11
  • @E.ZY. I. have updated my question with the rest of the code. – CyberPunk Nov 04 '20 at 08:12
  • Does this answer your question? [Age from birthdate in python](https://stackoverflow.com/questions/2217488/age-from-birthdate-in-python) – pradeexsu Nov 04 '20 at 08:13
  • does your first variable a df or int? but your sql might be wrong, idk is when you paste the code, the ) come before """ – E.ZY. Nov 05 '20 at 06:17

1 Answers1

0

The problem is likely due to mixing of datatypes. Since I am not sure what data types your columns are, here is a solution with both TimestampType and DateType columns:

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = spark.createDataFrame(
    data=[
        (1, "foo", datetime.strptime("1999-12-19", "%Y-%m-%d"), datetime.strptime("1999-12-19", "%Y-%m-%d").date()),
        (2, "bar", datetime.strptime("1989-12-14", "%Y-%m-%d"), datetime.strptime("1989-12-14", "%Y-%m-%d").date()),
    ],
    schema=T.StructType([
        T.StructField("id", T.IntegerType(), True),
        T.StructField("name", T.StringType(), True),
        T.StructField("birth_ts", T.TimestampType(),True),
        T.StructField("birth_date", T.DateType(), True)
      ])
)

df = df.withColumn("age_ts", F.floor(F.datediff(F.current_timestamp(), F.col("birth_ts"))/365.25))
df = df.withColumn("age_date", F.floor(F.datediff(F.current_date(), F.col("birth_date"))/365.25))
df.show()
id name birth_ts birth_date age_ts age_date
1 foo 1999-12-19 00:00:00 1999-12-19 22 22
2 bar 1989-12-14 00:00:00 1989-12-14 32 32
dsalaj
  • 2,857
  • 4
  • 34
  • 43