I am trying to identify and replace unicode characters from strings that I am processing to make keyword match filters.
For example, given the string
"Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"
I want the output from when I use the re.sub function (replace the pattern with blank space " ") to be
"Apple iPhone 12 mini A2176 128GB (PRODUCT) Red! Perfect condition! Unlocked!"
So I went to a regex build and test website and came up with this pattern
\\u[a-z|0-9]{4}
Which captures the 2 unicode strings
\u00a0 and \u00a0
Now trying to apply it to my python code I first tried this snippet. Here I use the findall
function to see if the code would return the unicode strings
import re
strin = "Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"
print(re.findall('\\u[a-z|0-9]{4}', strin))
which causes the following error to return
re.error: incomplete escape \u at position 0
I then tried adding an 'r' in front of the string pattern. No error appears but there is no unicode string returned
print(re.findall(r'\\u[a-z|0-9]{4}', strin))
output is an empty list []
I then tried the same 2 approaches but with only 1 backslash
print(re.findall('\u[a-z|0-9]{4}', strin))
gives SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
print(re.findall(r'\u[a-z|0-9]{4}', strin)) gives
re.error: incomplete escape \u at position 0