I have a question. I have a dataframe with a string variable that originally includes 'Ä', 'ö','ü' etc. I would like to replace these characters with Ae, oe etc. A direct regexp_replace from ü to ue does not work of course. When I do
df.show()
Pyspark shows me the respective character everywhere as �. I googled a bit and when I try to encode this column with
decode(df.column, 'ISO-8859-1') *or* decode(df.column, 'ascii')
, I can get rid of �, however, the return is always ¿½k, i.e. the encoding operation doesn't distinguish between ä,ö etc.. I tried all possible de- /encoding arguments mentioned here https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.decode.html
Does someone know the solution to this problem? Thanks!