1

I've just started learning regular expressions and the documentation for re.sub() states:

Changed in version 3.5: Unmatched groups are replaced with an empty string.

Deprecated since version 3.5, will be removed in version 3.6: Unknown escapes consist of '\' and ASCII letter now raise a deprecation warning and will be forbidden in Python 3.6.

Is re.sub() deprecated? What should I use then?

2 Answers2

3

You misunderstand the documentation. The re.sub() function is not deprecated. The deprecation warning concerns specific syntax.

Earlier in the re.sub() documentation you'll find this:

Unknown escapes such as \& are left alone.

If you used and unknown escape with an ASCII letter the escape will no longer be ignored, you'll get a warning instead. This applies both to re.sub() replacement patterns and to the regular expression patterns. The same warning is placed in the section on regex pattern syntax.

The Changed in version 3.5 line also concerns how re.sub() works. Rather than raise an exception when there is no matching group for a \number backreference, an empty string is inserted at that location.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • A `\digit` is not an *escape sequence*, it is a backreference consisting of a literal ``\`` + a *number* (there are groups with IDs more than `9`). – Wiktor Stribiżew Aug 10 '16 at 20:20
  • @Wiktor: right, but it is still an escape sequence, one that *signifies* a backreference. Just like `\n` is an escape sequence that signifies a newline. – Martijn Pieters Aug 10 '16 at 21:12
  • The `"\n"` is an escape sequence, but `r"\n"` is not an escape sequence, it is just a combination of ``\`` and `n` - I doubt any regex reference refers to them (and `\w`, `\d`, `\S`, etc.) as *escape sequences*. – Wiktor Stribiżew Aug 10 '16 at 21:15
  • @WiktorStribiżew the regex engine still interprets the `\n` combination as special. You match a literal newline character with them. – Martijn Pieters Aug 10 '16 at 21:18
  • Right, I just would not call it an escape sequence. Maybe I am too used to Microsoft terminology, where *escape sequence* refers to the string literals `\n`, `\r`, `\f`, `\a`, etc. – Wiktor Stribiżew Aug 10 '16 at 21:25
  • @WiktorStribiżew it's a generic term that refers to the technique of flagging a change in meaning from the norm. They are not limited to string literals; switching terminal display modes and colours also use escape sequences, as do modem commands, and many more applications. See Wikipedia: https://en.wikipedia.org/wiki/Escape_sequence – Martijn Pieters Aug 10 '16 at 22:22
1

The two entries are not related, and re.sub will not be deprecated.

In Python version earlier than 3.5 re.sub failed if a backreference was used to a capturing group that did not participate in the match. See Empty string instead of unmatched group error SO question.

An example where the failure occurred:

import re
old = 'regexregex'
new = re.sub(r'regex(group)?regex', r'something\1something', old)
print(new) # => fail as there is no "group" in between "regex" and "regex" in "regexregex"
#    and Group 1 was not initialized with an empty string, i.e. remains null

As for the second one, it only says that there will be a warning (and later forbidden) if you use an unknown for a regex engine literal backslash followed with an ASCII character. The backslash was just ignored in them before, in Python 2.x through 3.5, print(re.sub(r'\j', '', 'joy')) prints oy. So, these will be forbidden in Python 3.6.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you for the explanation. Such a dumb question. –  Aug 10 '16 at 19:19
  • Well, the `re.sub` has been bugging for a long time, it is really a great change that came with Python 3.5. As for the unknown escape sequences, almost every Python regex SO (good) answer states that it is best practice to use *raw string literals* to define regex patterns. Now, it will be supported by these warnings. – Wiktor Stribiżew Aug 10 '16 at 19:20
  • Rather than describe the `re.sub` issue here, I hope the link to one of my answers will provide extensive reference on that issue that will remain in Python 2.x. – Wiktor Stribiżew Aug 10 '16 at 19:22
  • Python's too-forgiving string literals have always bugged me, but that's not what the warning is about. It's referring to unknown *regex* escapes, like `\j`. Python has always ignored the backslash, which is arguably wrong and definitely inconsistent with other flavors. – Alan Moore Aug 10 '16 at 19:29
  • @AlanMoore: I see, I fixed that and updated with more illustrations. – Wiktor Stribiżew Aug 10 '16 at 20:05