0

I am trying to extract Age from DOB column in my Dataframe (in MM/DD/YYYY format & datatype string)

from pyspark.sql.functions import to_date, datediff, floor, current_date
from pyspark.sql import functions as F
from pyspark.sql.functions import col

RawData_Combined = RawData_Combined.select(col("DOB"),to_date(col("DOB"),"MM-dd-yyyy").alias("DOBFINAL"))

RawData_Combined = RawData_Combined.withColumn('AgeDOBFinal', (F.months_between(current_date(), F.col('DOBFINAL')) / 12).cast('int'))

but when i do RawData_Combined.show()

it is giving below output

+----------+--------+-----------+
|       DOB|DOBFINAL|AgeDOBFinal|
+----------+--------+-----------+
| 4/17/1989|    null|       null|
| 3/16/1964|    null|       null|
|  1/1/1970|    null|       null|
| 3/30/1967|    null|       null|
|  2/1/1989|    null|       null|
|  1/1/1995|    null|       null|
|      null|    null|       null|
|  1/1/1976|    null|       null|
|      null|    null|       null|
|  1/1/1958|    null|       null|
|  1/1/1960|    null|       null|
|  1/1/1973|    null|       null|
| 5/18/1988|    null|       null|
|      null|    null|       null|
|  3/3/1980|    null|       null|
|  7/3/1988|    null|       null|
|  1/1/1997|    null|       null|
|  1/1/1961|    null|       null|
|10/16/1955|    null|       null|
|  5/5/1982|    null|       null|
+----------+--------+-----------+
only showing top 20 rows
Ronak Jain
  • 3,073
  • 1
  • 11
  • 17
RajatK350
  • 71
  • 8

1 Answers1

0

This would work:

df.withColumn("DOBFINAL", F.to_date(F.col("DOB"),"M/dd/yyyy"))\
.withColumn('AgeDOBFinal', (F.months_between(F.current_date(), F.col('DOBFINAL')) / 12).cast('int'))\
.show()

Input:

+----------+
|       DOB|
+----------+
| 4/17/1989|
|10/16/1955|
+----------+

Output:

+----------+----------+-----------+
|       DOB|  DOBFINAL|AgeDOBFinal|
+----------+----------+-----------+
| 4/17/1989|1989-04-17|         33|
|10/16/1955|1955-10-16|         67|
+----------+----------+-----------+

Ref.

'M' or 'L': Month number in a year starting from 1. There is no difference between ‘M’ and ‘L’. Month from 1 to 9 are printed without padding.

https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Ronak Jain
  • 3,073
  • 1
  • 11
  • 17
  • It didnt work it was showing Py4JJ Error probably because i am using latest version of hadoop so I tried below code from pyspark.sql import functions as F RawData_Combined = RawData_Combined.withColumn("date_in_dateFormat", F.to_date(F.to_timestamp(col("DOB"), "M/d/y"))) – RajatK350 Mar 21 '23 at 12:25
  • @RajatK350 It shouldn't be M/d/y I think, can you share the exact error ? – Ronak Jain Mar 21 '23 at 12:37
  • Hi Ronak, Sorry for late reply this is the error i was getting SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '12/1/2010 8:26' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. – RajatK350 Mar 22 '23 at 04:28
  • So i looking at this https://stackoverflow.com/questions/62943941/to-date-fails-to-parse-date-in-spark-3-0 and tried the code and issue got resolved – RajatK350 Mar 22 '23 at 04:29