0

I have some arrays like this:

['دوره', 'ندارد', '\uf0d6', 'دارد']

I want to remove the specific Unicode characters like \uf0d6 from them. I have tried this:

  for item in tagArray[0]:
    if item.startswith('\u'):
      tagArray[0] = tagArray[0].remove(item)

but when I run it, I receive this error:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

How to fix this issue?

Edit: The Encode/Decode for ASCII Characters or the <128 methods don't work, since Persian characters aren't ASCII and it removes all of them.

BlueBlue
  • 81
  • 1
  • 8
  • 1
    Does this answer your question? [Replace non-ASCII characters with a single space](https://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space) – sahasrara62 Sep 06 '21 at 19:50
  • Its because python expects 4 digits after the `\u`. Instead, try looking for unicode characters through a range outside of the ASCII range (> 128). – Frontear Sep 06 '21 at 19:56
  • @sahasrara62 this method doesn't work because Persian characters aren't ASCII, so it removes all of them – BlueBlue Sep 06 '21 at 20:05
  • @Frontear this method doesn't work because Persian characters aren't ASCII, so it removes all of them – BlueBlue Sep 06 '21 at 20:12
  • You can create an exception for all Persian unicode letters, which should exclude your strings. Consider a Unicode table for reference. – Frontear Sep 06 '21 at 20:22
  • 1
    '\uXXXX' is not a string consisting many characters: it doesn't have `\` nor `u`. It is just an escape sequence meaning: *this really mean the character/code point at XXXX (in Unicode Database)*. When Python execute the code, it just see one character. Note: it is much like other escape sequences, like `\n` (usually for unprintable and or difficult to type, or just because a specific code point is needed). – Giacomo Catenazzi Sep 07 '21 at 06:23
  • There really isn't enough information to answer this question well, but note that the character shown as an escape code is a private use character and has no assigned definition. You can use `unicodedata.category` to filter out character types you don't want. For example `Lo` or *Letter, other* is the category of the Persian code points, and `Co` (*private use*) is the category of U+F0D6. – Mark Tolonen Sep 07 '21 at 17:06

1 Answers1

1

You can try the following code snippet. Hope it works for you.

Code:

import re

def remove_unicode(str):
    return re.sub(r'[^\w]+', '', str, flags=re.U)

str_list = ['دوره', 'ندارد', '\uf0d6', 'دارد']
res_list = []

for str in str_list:
    res_str = remove_unicode(str)
    if res_str:
        res_list.append(res_str)

print(res_list)

Output:

['دوره', 'ندارد', 'دارد']
Sabil
  • 3,750
  • 1
  • 5
  • 16