I am trying to extract Age from DOB column in my Dataframe (in MM/DD/YYYY format & datatype string)
from pyspark.sql.functions import to_date, datediff, floor, current_date
from pyspark.sql import functions as F
from pyspark.sql.functions import col
RawData_Combined = RawData_Combined.select(col("DOB"),to_date(col("DOB"),"MM-dd-yyyy").alias("DOBFINAL"))
RawData_Combined = RawData_Combined.withColumn('AgeDOBFinal', (F.months_between(current_date(), F.col('DOBFINAL')) / 12).cast('int'))
but when i do RawData_Combined.show()
it is giving below output
+----------+--------+-----------+
| DOB|DOBFINAL|AgeDOBFinal|
+----------+--------+-----------+
| 4/17/1989| null| null|
| 3/16/1964| null| null|
| 1/1/1970| null| null|
| 3/30/1967| null| null|
| 2/1/1989| null| null|
| 1/1/1995| null| null|
| null| null| null|
| 1/1/1976| null| null|
| null| null| null|
| 1/1/1958| null| null|
| 1/1/1960| null| null|
| 1/1/1973| null| null|
| 5/18/1988| null| null|
| null| null| null|
| 3/3/1980| null| null|
| 7/3/1988| null| null|
| 1/1/1997| null| null|
| 1/1/1961| null| null|
|10/16/1955| null| null|
| 5/5/1982| null| null|
+----------+--------+-----------+
only showing top 20 rows