Good afternoon everyone, I have a problem to clear special characters in a string column of the dataframe, I just want to remove special characters like html components, emojis and unicode errors, for example \u2013
.
does anyone have an regular expression to help me? Or any suggestions on how to treat this problem?
input:
i want to remove and codes "\u2022"
expected output:
i want to remove and codes
I tried:
re.sub('[^A-Za-z0-9 \u2022]+', '', nome)
regexp_replace('nome', '\r\n|/[\x00-\x1F\x7F]/u', ' ')
df = df.withColumn( "value_2", F.regexp_replace(F.regexp_replace("value", "[^\x00-\x7F]+", ""), '""', '') )
df = df.withColumn("new",df.text.encode('ascii', errors='ignore').decode('ascii'))
tried some solutions but none recognizes the character "\u2013", has anyone experienced this?