How to read csv file with € and % symbol

Question

Im reading a csv file and creating a pyspark dataframe. The columns TrueValue and PickoutValue contains "€" and "%" symbols. After reading, Im getting the € symbol as "� ".

Month       TrueValue           PickoutValue
1/1/2021    4728                52500
1/1/2021    4313                0
2/1/2021    3101                2500
2/1/2021    0                   0
3/1/2021    6.90%               6.60%
2/1/2021    75.60%              70.00%
3/1/2021    � 373,020,387.05    � 223,885,862.89

I need to create a new column "ResultValue" by dividing ((TrueValue/PickoutValue)*100) This is what I tried

df_src=spark.read.csv(src_path, header=True, encoding='ISO-8859-1')
df=df.select('Month', \
        'TrueValue',F.translate(F.col('TrueValue'),"%\u20ac� ","").alias('TrueValueReplaced') \
        'PickoutValue',F.translate(F.col('PickoutValue'),"%\u20ac� ","").alias('PickoutValueReplaced')) \
        .withColumn('ResultValue', (col('TrueValueReplaced')/col('PickoutValueReplaced')*100)) \
                  .drop('TrueValueReplaced').drop('PickoutValueReplaced')

But, this is not replacing the � symbol and im not getting the desired dataframe. Any other approaches pls advice..

Month       TrueValue           PickoutValue        TrueValueReplaced       PickoutValueReplaced        ResultValue
1/1/2021    4728                52500               4728                    52500                       9.005714285714287
1/1/2021    4313                0                   4313                    0                           null    
2/1/2021    3101                2500                3101                    2500                        124.03999999999999
2/1/2021    0                   0                   0                       0                           null
3/1/2021    6.90%               6.60%               6.90                    6.60                        104.54545454545456
2/1/2021    75.60%              70.00%              75.60                   70.00                       107.99999999999999
3/1/2021     373,020,387.05  223,885,862.89  373,020,387.05      223,885,862.89         null

You are getting `�` because you opened the file with the wrong chsrset, probably its ISO-8859-1 and you open it as UTF-8. — CherryDT, Aug 30 '21 at 06:16
Can you print ```TrueValueReplaced``` and ```PickoutValueReplaced``` as well? — Robert Kossendey, Aug 30 '21 at 06:18
You read it just like any other file - making sure you use the correct charset/encoding. If the file uses Latin1/ISO-8859-1, it would be a good idea to change the exporting code to use UTF8. Otherwise specify `encoding='ISO-8859-1'`. I suspect your code worked up to now because Python 3 uses UTF8 and the files you imported used the 7-bit US-ASCII range instead of `Latin1` — Panagiotis Kanavos, Aug 30 '21 at 06:18
Does https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text help? How about https://realpython.com/python-encodings-guide/ ? — Karl Knechtel, Aug 30 '21 at 06:21
Hi all, Please find the updated code in the question. I have added the encoding and `TrueValueReplaced` and `PickoutValueReplaced` column — user175025, Aug 30 '21 at 06:33
You can't fix an encoding issue with `F.translate(F.col('TrueValue'),"%\u20ac� ","")`. `�` is the Unicode Replacement character, used when reading encoded text with the *wrong* encoding. This means the original data is lost. What if that was ¥ instead of € ? You have to use the correct `encoding`. Or change the writer to use UTF8 — Panagiotis Kanavos, Aug 30 '21 at 06:34
`Any other approaches pls advice..` there is no other. Find and use the correct encoding. — Panagiotis Kanavos, Aug 30 '21 at 06:36
Does this mean that the input file encoding must be changed? — user175025, Aug 30 '21 at 06:41
It means you need to know what the encoding is, and then use that encoding when opening the file. Perhaps see also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors — tripleee, Aug 30 '21 at 06:50
Does this answer your question? [How to process currency symbols in a csv file in pyspark dataframe](https://stackoverflow.com/questions/68983652/how-to-process-currency-symbols-in-a-csv-file-in-pyspark-dataframe) — Steven, Aug 30 '21 at 13:31

How to read csv file with € and % symbol

0 Answers0