I have a string like this: bat★☆ ⛱ ✨♂️⛷❤️֎۩ᴥ★ Lôa Créole♥ Now, I need to replace all of emoji symbol to empty string but I also need to remain ô and é. I checked from internet to use like this:
regexp_replace(df("word"), """[^ 'a-zA-Z0-9,.?!]""","")
But this method also covered ô and é. Would you please help to guide how to exclude the ô and é, only emoji symbol
scala> val df = Seq(
| (8, "bat★☆ ⛱ ✨♂⛷❤֎۩ᴥ★ Lôa Créole♥"),
| (64, "bb")
| ).toDF("number", "word")
df: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> df.select($"number", $"word", regexp_replace(df("word"), """[^ 'a-zA-Z0-9,.?!]""","").alias("word_revised")).show(false)
+------+------------------------------------------------+---------------+
|number|word |word_revised |
+------+------------------------------------------------+---------------+
|8 |bat★☆ ⛱ ✨♂️⛷❤️֎۩ᴥ★ Lôa Créole♥|bat La Crole|
|64 |bb |bb |
+------+------------------------------------------------+---------------+