I have a spark df with a string column with special characters such as áãâàéêèíîìóõôòúûùç
and I want to replace them with respectively aaaaeeeiiioooouuuc
As an example of what I want:
name | unaccent
Vitória | Vitoria
João | Joao
Maurício | Mauricio
I found this example but it doesn't work for these special characters Pyspark removing multiple characters in a dataframe column
I've tried to manually create this df but for some reason I couldn't replicate the special characters and a question mark ?
shows up:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
df = spark.createDataFrame(data=[("Vitória",),("João",),("Maurício",)], schema= StructType([StructField("A", StringType(),True)]))
df.show()
+--------+
| A|
+--------+
| Vit?ria|
| Jo?o|
|Maur?cio|
+--------+
When I use the translate
function this is the result
df.select("A",
F.translate(F.col("A"), "áãâàéêèíîìóõôòúûùç", "aaaaeeeiiioooouuuc").alias("unaccent")).show()
+--------+--------+
| A|unaccent|
+--------+--------+
| Vit?ria| Vitaria|
| Jo?o| Joao|
|Maur?cio|Mauracio|
+--------+--------+
Any thoughts on how to unaccent these special characters?