why python2's re module can't identify the u'®' character

Question

I got a string and I want to re.sub this string in Python2, so I tried the following statement, it worked

>>> import re
>>> re.sub(u"[™®]", "", u"a™b®c")
'abc'

But when I tried the following statement, it just failed on both Windows 10 (Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)] on win32).

>>> re.sub(ur"[\u2122\u00ae]", "", u"a™b®c")
u'a?b?c'

I've tried the the solution from Python and regular expression with Unicode, but it didn't work neither.

>>> myre = re.compile(ur'[\u2122\u00ae]', re.UNICODE)
>>> print myre.sub('', u"a™b®c")

So why this happen and how can I fix this?

This isn't your issue, but you really shouldn't be trying to sub an 8-bit string `""` into a Unicode string `u"a™b®c"`. In order to do that, Python has to guess whether you want to encode one or decode the other, and, even though it happens to guess right, you're still relying on something non-obvious, and making your code a bit slower, for no good reason. — abarnert, Jul 29 '18 at 03:14
This works fine on my linux machine with python 2.7.14. I can't reproduce your bug. — Håken Lid, Jul 29 '18 at 03:23
@HåkenLid yes, this code works perfect on Ubuntu, what I mean by Linux in my question is another distribution — calvin, Jul 29 '18 at 03:26
Edit the question and add all relevant information about the platform and python version. — Håken Lid, Jul 29 '18 at 03:27
Is there a reason you need to use Python 2? Because dealing with the two problems you have here in Python 2 is a huge pain, while in Python 3 they don't even come up in the first place—and, in fact, fixing that is the whole reason they made a breaking change to the language 9 years ago. — abarnert, Jul 29 '18 at 03:35
@abarnert Yes, you are right, Python 3 will ease the pain dealing with encoding problems. However Python 2 is still used, e.g. in the project I'm currently working on. Though I can figure out some workarounds, I still wonder if there are some better solutions with Python 2. — calvin, Jul 31 '18 at 05:11
There really aren't better solutions with Python 2. You can be careful to always use `unicode` values (encoding and decoding as close to the edge as possible), maybe use PEP 484 type hints in comment form plus Mypy to make sure you don't screw up and use `str`, never use Unicode characters in literals, etc., but it's still going to be a pain. If there were better solutions than that in Python 2, Python 3 wouldn't exist. — abarnert, Jul 31 '18 at 05:15
Meanwhile, if the project you're working on has no plans to upgrade to Python 3, you should keep in mind that there's less than a year and a half until Python 2 hits end-of-life; Ubuntu, Red Hat, Anaconda, etc. are only giving minimal support beyond that; many libraries have already relegated 2.x to only "legacy" support… it's not going to get easier from here, it's going to get harder. — abarnert, Jul 31 '18 at 05:20

abarnert · Accepted Answer · 2018-07-31T05:21:19.277

You have two problems here.

First, the whole point of raw string literals is that they don't treat backslash escapes as backslash escapes. So, ur"[\u2122\u00ae]" is literally the characters [, \, u, 2, 1, etc.

In Python 3, that's fine, because the re module understands \u escapes as meaning Unicode characters, so the pattern ends up being the character class with U+2122 and U+00AE in it, exactly as you want. But in Python 2, it doesn't, so the character class ends up being a mess of useless junk.

If you change it to use a non raw string literal, that will solve that problem: u"[\u2122\u00ae]". Of course that will bring up all the other potential problems that make people want to use raw string literals in the first place with regular expressions—but fortunately, you don't have any of them here.

The second problem is that you're using Unicode characters in Unicode literals without an encoding declaration. Again, not a problem in Python 3, but it is in Python 2.

When you type "a™b®c", there's a good chance that you're actually giving Python not a \u2122 character, but a \u0099 character. Your console is probably in something like cp1252, so when you type or paste a ™, what it actually gives Python is, U+0099, not U+2122. Of course your console also displays things incorrectly, so that U+0099 ends up looking like a ™. But Python doesn't have any idea what's going on. It just sees that U+0099 is not the same character as U+2122, and therefore there's no match. (Your first example works because your search string also has the incorrect \u0099, so it happens to match.)

In source code, you could fix this, either by adding an encoding declaration to tell Python that you're using cp1252, or by telling your editor to use UTF-8 instead of cp1252 in the first place. But at the interactive interpreter, you get whatever encoding your console wants, and there's nowhere to put an encoding declaration.

Really, there's no good solution to this.

Well, there is: upgrade to Python 3. The main reason it exists in the first place is to make Unicode headaches like this go away, and Python 2 is less than a year and a half from end of life, so do you really want to learn how to deal with Unicode headaches in Python 2 today?

You could also get a UTF-8 terminal (and one that Python recognizes as such). That's automatic on macOS or most recent Linux distros; on Windows, it's a lot harder, and probably not the way you want to go here.

So, the only alternative is to just never use Unicode characters in Unicode literals on the interactive interpreter. Again, you can use them in source code, but interactively, you have to either:

Use backslash escapes.
Use non-Unicode literals and carefully decode them everywhere.

I'm not sure whether "a™b®c".decode('cp1252') is really better than \u escapes, but it will work.

According to your suggestion, I used them in source code and that works fine by using unicode. Meanwhile, I update the regex to `u"[\u0099\u2122]"` to see if Python could identify the `™` mark if the console did some mis-interpreting, and the result shows Python still can't match, maybe it's because my console is by default ANSI? All in all, I think you are right, using Python 2 is the problem itself. — calvin, Jul 31 '18 at 05:37

score 0 · Answer 2 · answered Jul 29 '18 at 03:05

0

Just remove the r before the string and it works:

re.sub(u"[\u2122\u00ae]", "", u"a™b®c")

answered Jul 29 '18 at 03:05

John Zwinck

239,568
38
324
436

I copied your code and it's still not working on `Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)] on win32`. It prints `u'a?b?c'` – calvin Jul 29 '18 at 03:07
@calvin Both your attempt in the question and this answer works on my machine. I'm using Python 2.7.10 and macOS HighSierra. Maybe something to do with the encoding of Windows? – Sweeper Jul 29 '18 at 03:14
1

This will work on Mac, but only because the Mac terminal is UTF-8. It won't work on Windows. – abarnert Jul 29 '18 at 03:31

why python2's re module can't identify the u'®' character

2 Answers2