1

I want to create new data frame with updating data from few columns in old data frame in pyspark.

I have below data frame with parquet format which has columns like uid, name, start_dt, addr, extid

df = spark.read.parquet("s3a://testdata?src=ggl")
df1 = df.select("uid")

I have to create a new data frame in parquet with hashed uid and extid and include the remaining columns also. Please suggest how to do this? I am new :(

Sample input:

uid, name, start_dt, addr, extid
1124569-2, abc, 12/02/2018, 343 Beach Dr Newyork NY, 889

Sample output:

uid, name, start_dt, addr, extid
a8ghshd345698cd, abc, 12/02/2018, 343 Beach Dr Newyork NY, shhj676ssdhghje

Here uid and extid are sha256 hashed.

Thanks in advance.

Manoj Singh
  • 1,627
  • 12
  • 21
Spark user
  • 83
  • 6

2 Answers2

2

You can create a UDF function which call the hashlib.sha256() on the column and use the withColumn to transform the column.

import pyspark.sql.functions as F
import pyspark.sql.types as T
import hashlib

df = spark.read.parquet("s3a://testdata?src=ggl")

sha256_udf = F.udf(lambda x: hashlib.sha256(str(x).encode('utf-8')).hexdigest(), T.StringType()) 
df1 = df.withColumn('uid', sha256_udf('uid')).withColumn('extid', sha256_udf('extid'))
df1.show()
Manoj Singh
  • 1,627
  • 12
  • 21
  • 1
    I recommend you to use Manoj soluation. Westside's soluation is more specfic but Manoj's soluation is more general. If you like to proceed with learning Spark sooner or late you have to learn what udfs are. They will help with lot of issues. – jonathan Dec 11 '18 at 11:51
  • @jonathan sure, `udf`s are flexible and *sometimes* they are the only option but you will get [better performance](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance?rq=1) if you can avoid them. – pault Dec 11 '18 at 16:53
2

pyspark already has builtin function for generating sha-256 in pyspark.sql.functions module.

Create Sample Data

from pyspark.sql.functions import sha2
df1 = spark.createDataFrame(
    [
        Row(
            uid="1124569-2",
            name="abc",
            start_dt="12/02/2018",
            addr="343 Beach Dr Newyork NY",
            extid="889"
        )
     ]
)        
df1.show()
#+--------------------+-----+----+----------+---------+
#|                addr|extid|name|  start_dt|      uid|
#+--------------------+-----+----+----------+---------+
#|343 Beach Dr Newy...|  889| abc|12/02/2018|1124569-2|
#+--------------------+-----+----+----------+---------+

Hash selected columns:

df1.select(
    sha2(df1['uid'],256).alias('uid'),
    sha2(df1['extid'],256).alias('extid'),
    'addr',
    'name',
    'start_dt'
).show()
#+--------------------+--------------------+--------------------+----+----------+
#|                 uid|               extid|                addr|name|  start_dt|
#+--------------------+--------------------+--------------------+----+----------+
#|4629619cdf1cbeed6...|a829c72c42755e384...|343 Beach Dr Newy...| abc|12/02/2018|
#+--------------------+--------------------+--------------------+----+----------+

We dont have to create udfs for that.

pault
  • 41,343
  • 15
  • 107
  • 149
frank
  • 656
  • 3
  • 12