Im reading a csv file and creating a pyspark dataframe. The columns TrueValue and PickoutValue contains "€" and "%" symbols. After reading, Im getting the € symbol as "� ".
Month TrueValue PickoutValue
1/1/2021 4728 52500
1/1/2021 4313 0
2/1/2021 3101 2500
2/1/2021 0 0
3/1/2021 6.90% 6.60%
2/1/2021 75.60% 70.00%
3/1/2021 � 373,020,387.05 � 223,885,862.89
I need to create a new column "ResultValue" by dividing ((TrueValue/PickoutValue)*100) This is what I tried
df_src=spark.read.csv(src_path, header=True, encoding='ISO-8859-1')
df=df.select('Month', \
'TrueValue',F.translate(F.col('TrueValue'),"%\u20ac� ","").alias('TrueValueReplaced') \
'PickoutValue',F.translate(F.col('PickoutValue'),"%\u20ac� ","").alias('PickoutValueReplaced')) \
.withColumn('ResultValue', (col('TrueValueReplaced')/col('PickoutValueReplaced')*100)) \
.drop('TrueValueReplaced').drop('PickoutValueReplaced')
But, this is not replacing the � symbol and im not getting the desired dataframe. Any other approaches pls advice..
Month TrueValue PickoutValue TrueValueReplaced PickoutValueReplaced ResultValue
1/1/2021 4728 52500 4728 52500 9.005714285714287
1/1/2021 4313 0 4313 0 null
2/1/2021 3101 2500 3101 2500 124.03999999999999
2/1/2021 0 0 0 0 null
3/1/2021 6.90% 6.60% 6.90 6.60 104.54545454545456
2/1/2021 75.60% 70.00% 75.60 70.00 107.99999999999999
3/1/2021 373,020,387.05 223,885,862.89 373,020,387.05 223,885,862.89 null