I want to create new data frame with updating data from few columns in old data frame in pyspark.
I have below data frame with parquet format which has columns like uid, name, start_dt, addr, extid
df = spark.read.parquet("s3a://testdata?src=ggl")
df1 = df.select("uid")
I have to create a new data frame in parquet with hashed uid and extid and include the remaining columns also. Please suggest how to do this? I am new :(
Sample input:
uid, name, start_dt, addr, extid
1124569-2, abc, 12/02/2018, 343 Beach Dr Newyork NY, 889
Sample output:
uid, name, start_dt, addr, extid
a8ghshd345698cd, abc, 12/02/2018, 343 Beach Dr Newyork NY, shhj676ssdhghje
Here uid and extid are sha256 hashed.
Thanks in advance.