1

I have a question. I have a dataframe with a string variable that originally includes 'Ä', 'ö','ü' etc. I would like to replace these characters with Ae, oe etc. A direct regexp_replace from ü to ue does not work of course. When I do

df.show()

Pyspark shows me the respective character everywhere as �. I googled a bit and when I try to encode this column with

decode(df.column, 'ISO-8859-1') *or* decode(df.column, 'ascii')

, I can get rid of �, however, the return is always ¿½k, i.e. the encoding operation doesn't distinguish between ä,ö etc.. I tried all possible de- /encoding arguments mentioned here https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.decode.html

Does someone know the solution to this problem? Thanks!

Niels
  • 141
  • 8
  • 2
    Does this solve your issue: [What is the best way to remove accents with Apache Spark dataframes in PySpark?](https://stackoverflow.com/questions/38359534/what-is-the-best-way-to-remove-accents-with-apache-spark-dataframes-in-pyspark) – kinshukdua Oct 01 '21 at 10:29
  • Solved it by myself. You need to set the decoding option already when reading in the dataset. `dataframe= spark.read.csv('path/data.csv', header=True, encoding="ISO-8859-1")` Also, I set the magical comment `# coding=utf-8` in the top row. After that, you can just use ä ü etc. in a normal way and replace the characters. – Niels Oct 04 '21 at 08:54
  • `# coding=utf-8` declares the encoding of the *source file* only and has no effect on other encoding/decoding operations. It is also the default encoding of source files expected by Python 3, so is not needed if your source file is saved with UTF-8 encoding. – Mark Tolonen Oct 05 '21 at 15:50

0 Answers0