Remove an specific unicode character from an array Python

Question

I have some arrays like this:

['دوره', 'ندارد', '\uf0d6', 'دارد']

I want to remove the specific Unicode characters like \uf0d6 from them. I have tried this:

  for item in tagArray[0]:
    if item.startswith('\u'):
      tagArray[0] = tagArray[0].remove(item)

but when I run it, I receive this error:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

How to fix this issue?

Edit: The Encode/Decode for ASCII Characters or the <128 methods don't work, since Persian characters aren't ASCII and it removes all of them.

Does this answer your question? [Replace non-ASCII characters with a single space](https://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space) — sahasrara62, Sep 06 '21 at 19:50
Its because python expects 4 digits after the `\u`. Instead, try looking for unicode characters through a range outside of the ASCII range (> 128). — Frontear, Sep 06 '21 at 19:56
@sahasrara62 this method doesn't work because Persian characters aren't ASCII, so it removes all of them — BlueBlue, Sep 06 '21 at 20:05
@Frontear this method doesn't work because Persian characters aren't ASCII, so it removes all of them — BlueBlue, Sep 06 '21 at 20:12
You can create an exception for all Persian unicode letters, which should exclude your strings. Consider a Unicode table for reference. — Frontear, Sep 06 '21 at 20:22
'\uXXXX' is not a string consisting many characters: it doesn't have `\` nor `u`. It is just an escape sequence meaning: *this really mean the character/code point at XXXX (in Unicode Database)*. When Python execute the code, it just see one character. Note: it is much like other escape sequences, like `\n` (usually for unprintable and or difficult to type, or just because a specific code point is needed). — Giacomo Catenazzi, Sep 07 '21 at 06:23
There really isn't enough information to answer this question well, but note that the character shown as an escape code is a private use character and has no assigned definition. You can use `unicodedata.category` to filter out character types you don't want. For example `Lo` or *Letter, other* is the category of the Persian code points, and `Co` (*private use*) is the category of U+F0D6. — Mark Tolonen, Sep 07 '21 at 17:06

score 1 · Accepted Answer · answered Sep 06 '21 at 20:21

1

You can try the following code snippet. Hope it works for you.

Code:

import re

def remove_unicode(str):
    return re.sub(r'[^\w]+', '', str, flags=re.U)

str_list = ['دوره', 'ندارد', '\uf0d6', 'دارد']
res_list = []

for str in str_list:
    res_str = remove_unicode(str)
    if res_str:
        res_list.append(res_str)

print(res_list)

Output:

['دوره', 'ندارد', 'دارد']

answered Sep 06 '21 at 20:21

Sabil

3,750
1
5
16

Would this work if the unicode character was hard coded in, rather than with `\u`? – Frontear Sep 06 '21 at 20:24
Can't assure but hopefully it will work. You can try and let me know if it's not working – Sabil Sep 06 '21 at 20:26
1

This answer lacks of explanations, so not so useful. We like to understand problems (so we solve not just one problem, but a entire problem class). – Giacomo Catenazzi Sep 07 '21 at 06:19

Remove an specific unicode character from an array Python

1 Answers1