0

I have a spark df with a string column with special characters such as áãâàéêèíîìóõôòúûùç and I want to replace them with respectively aaaaeeeiiioooouuuc

As an example of what I want:

name        | unaccent          
Vitória     | Vitoria
João        | Joao
Maurício    | Mauricio

I found this example but it doesn't work for these special characters Pyspark removing multiple characters in a dataframe column

I've tried to manually create this df but for some reason I couldn't replicate the special characters and a question mark ? shows up:

from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

df = spark.createDataFrame(data=[("Vitória",),("João",),("Maurício",)], schema= StructType([StructField("A", StringType(),True)]))
df.show()

+--------+
|       A|
+--------+
| Vit?ria|
|    Jo?o|
|Maur?cio|
+--------+

When I use the translate function this is the result

df.select("A",
          F.translate(F.col("A"), "áãâàéêèíîìóõôòúûùç", "aaaaeeeiiioooouuuc").alias("unaccent")).show()

+--------+--------+
|       A|unaccent|
+--------+--------+
| Vit?ria| Vitaria|
|    Jo?o|    Joao|
|Maur?cio|Mauracio|
+--------+--------+

Any thoughts on how to unaccent these special characters?

Tiago Shimizu
  • 265
  • 1
  • 2
  • 7

1 Answers1

0

It seems like the problem is in your IDE, not in PySpark.

My environment: jupiter notebook in VS Code (macos):

df.withColumn(
    "unaccent", 
    F.translate(F.col("A"), "áãâàéêèíîìóõôòúûùç", "aaaaeeeiiioooouuuc")
).show()

results in a correct output:

+--------+--------+
|       A|unaccent|
+--------+--------+
| Vitória| Vitoria|
|    João|    Joao|
|Maurício|Mauricio|
+--------+--------+

(spark.version = 3.2.1)

Alexander Volok
  • 5,630
  • 3
  • 17
  • 33