You have two problems here.
First, the whole point of raw string literals is that they don't treat backslash escapes as backslash escapes. So, ur"[\u2122\u00ae]"
is literally the characters [
, \
, u
, 2
, 1
, etc.
In Python 3, that's fine, because the re
module understands \u
escapes as meaning Unicode characters, so the pattern ends up being the character class with U+2122
and U+00AE
in it, exactly as you want. But in Python 2, it doesn't, so the character class ends up being a mess of useless junk.
If you change it to use a non raw string literal, that will solve that problem: u"[\u2122\u00ae]"
. Of course that will bring up all the other potential problems that make people want to use raw string literals in the first place with regular expressions—but fortunately, you don't have any of them here.
The second problem is that you're using Unicode characters in Unicode literals without an encoding declaration. Again, not a problem in Python 3, but it is in Python 2.
When you type "a™b®c"
, there's a good chance that you're actually giving Python not a \u2122
character, but a \u0099
character. Your console is probably in something like cp1252, so when you type or paste a ™
, what it actually gives Python is, U+0099, not U+2122. Of course your console also displays things incorrectly, so that U+0099
ends up looking like a ™
. But Python doesn't have any idea what's going on. It just sees that U+0099 is not the same character as U+2122, and therefore there's no match. (Your first example works because your search string also has the incorrect \u0099
, so it happens to match.)
In source code, you could fix this, either by adding an encoding declaration to tell Python that you're using cp1252, or by telling your editor to use UTF-8 instead of cp1252 in the first place. But at the interactive interpreter, you get whatever encoding your console wants, and there's nowhere to put an encoding declaration.
Really, there's no good solution to this.
Well, there is: upgrade to Python 3. The main reason it exists in the first place is to make Unicode headaches like this go away, and Python 2 is less than a year and a half from end of life, so do you really want to learn how to deal with Unicode headaches in Python 2 today?
You could also get a UTF-8 terminal (and one that Python recognizes as such). That's automatic on macOS or most recent Linux distros; on Windows, it's a lot harder, and probably not the way you want to go here.
So, the only alternative is to just never use Unicode characters in Unicode literals on the interactive interpreter. Again, you can use them in source code, but interactively, you have to either:
- Use backslash escapes.
- Use non-Unicode literals and carefully decode them everywhere.
I'm not sure whether "a™b®c".decode('cp1252')
is really better than \u
escapes, but it will work.