How to remove special characters,unicode emojis in pyspark?

Question

Good afternoon everyone, I have a problem to clear special characters in a string column of the dataframe, I just want to remove special characters like html components, emojis and unicode errors, for example \u2013.

does anyone have an regular expression to help me? Or any suggestions on how to treat this problem?

input:

i want to remove  and codes "\u2022"

expected output:

i want to remove and codes

I tried:

re.sub('[^A-Za-z0-9 \u2022]+', '', nome)

regexp_replace('nome', '\r\n|/[\x00-\x1F\x7F]/u', ' ')

df = df.withColumn( "value_2", F.regexp_replace(F.regexp_replace("value", "[^\x00-\x7F]+", ""), '""', '') )

df = df.withColumn("new",df.text.encode('ascii', errors='ignore').decode('ascii'))

tried some solutions but none recognizes the character "\u2013", has anyone experienced this?

an easy way to do this is to simply encode and then decode your string as ascii `'i want to remove ☺ and codes "\u2022"'.encode('ascii', errors='ignore').decode('ascii')`, now you just have to take care of double spaces or double `""` if you have them, but thats another simple string substitution — Nullman, Nov 05 '21 at 23:48

blackbishop · Answer 1 · 2021-11-06T11:45:32.077

You can use this regex to remove all unicode caracters from the column with regexp_replace function. Then remove extra double quotes that can remain:

import pyspark.sql.functions as F

df = spark.createDataFrame([('i want to remove  and codes "\u2022"',)], ["value"])

df = df.withColumn(
    "value_2",
    F.regexp_replace(F.regexp_replace("value", "[^\x00-\x7F]+", ""), '""', '')
)

df.show(truncate=False)

#+---------------------------------+----------------------------+
#|value                            |value_2                     |
#+---------------------------------+----------------------------+
#|i want to remove  and codes "•"|i want to remove  and codes |
#+---------------------------------+----------------------------+

jeff pentagon · Answer 2 · 2021-11-05T23:32:31.157

0

import re
L=re.findall(r"[^•]+", "abdasfrasdfadfsadfaa•sdf•adsfasfasfasf")
print(L) # prints ['abdasfrasdfadfs', 'adfaa', 'sdf', 'adsfasfasfasf']

So to remove the smiley and bullet emoji(\u2022), You apply that pattern above, call findall method, and then join the returned list. Like below:

import re
given_string = "Your input• string "
result_string = "".join(re.findall(r"[^•]+", given_string))
print(result_string) #prints 'Your input string '

if you know unicode number of emojis you can replace emojis with unicode number like below:

result_string = "".join(re.findall(r"[^\u2022]+", given_string))

edited Nov 05 '21 at 23:32

answered Nov 05 '21 at 23:24

jeff pentagon

796
3
12

I managed to solve the problem as follows: result_string = re.sub('"\\\\u2022\\"'," ",given_string ) Thanks to everyone for the suggested solutions. – Carlos Eduardo Bilar Rodrigues Nov 09 '21 at 11:30

How to remove special characters,unicode emojis in pyspark?

2 Answers2