Ä ö ü etc. in PySpark

Question

I have a question. I have a dataframe with a string variable that originally includes 'Ä', 'ö','ü' etc. I would like to replace these characters with Ae, oe etc. A direct regexp_replace from ü to ue does not work of course. When I do

df.show()

Pyspark shows me the respective character everywhere as �. I googled a bit and when I try to encode this column with

decode(df.column, 'ISO-8859-1') *or* decode(df.column, 'ascii')

, I can get rid of �, however, the return is always ¿½k, i.e. the encoding operation doesn't distinguish between ä,ö etc.. I tried all possible de- /encoding arguments mentioned here https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.decode.html

Does someone know the solution to this problem? Thanks!

Does this solve your issue: [What is the best way to remove accents with Apache Spark dataframes in PySpark?](https://stackoverflow.com/questions/38359534/what-is-the-best-way-to-remove-accents-with-apache-spark-dataframes-in-pyspark) — kinshukdua, Oct 01 '21 at 10:29
Solved it by myself. You need to set the decoding option already when reading in the dataset. `dataframe= spark.read.csv('path/data.csv', header=True, encoding="ISO-8859-1")` Also, I set the magical comment `# coding=utf-8` in the top row. After that, you can just use ä ü etc. in a normal way and replace the characters. — Niels, Oct 04 '21 at 08:54
`# coding=utf-8` declares the encoding of the *source file* only and has no effect on other encoding/decoding operations. It is also the default encoding of source files expected by Python 3, so is not needed if your source file is saved with UTF-8 encoding. — Mark Tolonen, Oct 05 '21 at 15:50

Ä ö ü etc. in PySpark

0 Answers0