0

A component in Python's docutils module uses the regular expression below in the machinery that is designed to translate text flanked with asterisks into italicised text:

Raw: Most people know what is meant by the latin phrase *Carpe Diem*.

Translated: Most people know what is meant by the latin phrase Carpe Diem.


It's a pretty straight-forward pattern: match an asterisk if it is not preceded by a space, a newline or the null character. What I'd like to know is what's gained by appending the empty unicode string (u'') to the pattern? It's appended to a number of other patterns that are also found within docutils, but i've no idea what difference it makes to whether a given bit of text matches or not.

non_whitespace_escape_before = r'(?<![ \n\x00])'
end_string_suffix = u''

emphasis=re.compile(non_whitespace_escape_before + r'(\*)' + end_string_suffix, re.U)
# emphasis.pattern -> u'(?<![ \\n\\x00])(\\*)'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Paul Patterson
  • 6,840
  • 3
  • 42
  • 56
  • http://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-in-python-and-what-are-raw-string-l – matt Apr 24 '17 at 11:12
  • It indicates the string's encoding, [string literals](https://docs.python.org/2.7/reference/lexical_analysis.html#strings). It can be a raw string or a unicode string. I don't think it is special for regex. – matt Apr 24 '17 at 11:17
  • @BryanOakley the result is a string concatenation of r"..." + u"". Which give you a unicode string in the end. – matt Apr 24 '17 at 11:26
  • @matt are you suggesting that appending u'' is just a way of converting the string to a unicode string? – Paul Patterson Apr 24 '17 at 11:30
  • I am just stating that is a consequence, I am not sure of the relevance. Martijn is going through the revisions. – matt Apr 24 '17 at 11:33

1 Answers1

4

You missed that the string is not always empty; from the relevant source code:

if getattr(settings, 'character_level_inline_markup', False):
    start_string_prefix = u'(^|(?<!\x00))'
    end_string_suffix = u''
else:
    start_string_prefix = (u'(^|(?<=\\s|[%s%s]))' %
                           (punctuation_chars.openers,
                            punctuation_chars.delimiters))
    end_string_suffix = (u'($|(?=\\s|[\x00%s%s%s]))' %
                         (punctuation_chars.closing_delimiters,
                          punctuation_chars.delimiters,
                          punctuation_chars.closers))

The gain is that the variable is defined everywhere; not that it is empty. It indeed makes 0 difference if it is empty, but if the character_level_inline_markup feature is enabled, the patterns that are compiled now have a suffix that changes behaviour compared to the empty string.

The docutils project is otherwise a little sloppy in mixing bytestrings and Unicode strings in Python 2; they get away with this because all bytestrings being concatenated to Unicode strings happen to be ASCII clean and thus can be decoded implicitly.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • @matt: note that I quoted the code on trunk in SVN here; it uses Unicode throughout; the OP seems to be quoting an older revision where the transition was still incomplete. `r` is not a regex string, it's a *raw string literal* and in Python 2 that'll produce a bytestring, and concatenating a unicode string triggers an implicit decode; that carries risks if there is non-ASCII data in any of these strings. So, *here* it doesn't matter, but it *could* potentially matter. – Martijn Pieters Apr 24 '17 at 11:26
  • Yes, I understand r is a raw string. But if they're using the regex for unicode strings, then appending the `u""` changes the regex to a unicode string instead of an 8bit string. – matt Apr 24 '17 at 11:28
  • Also, writing a raw string might make escaping easier. eg. `x = r"\n" + u""` instead of `x = u"\\n"` – matt Apr 24 '17 at 11:31
  • @matt: I'm trolling through the revision history; I suspect there is a Python 2 / 3 bridge thing going on too (there is no `ur'...'` raw string literal, because the semantics of `\uhhhh` escapes changed between 2 and 3). – Martijn Pieters Apr 24 '17 at 11:32
  • @matt: the pattern internally in Python 2 is either bytes or Unicode, so you want to pass in a `unicode` object when you want to be able to match Unicode codepoints (and not, say, the UTF-8 bytes); matches otherwise just fail (as the Unicode codepoint and encoded bytes won't produce a match). – Martijn Pieters Apr 24 '17 at 11:36
  • @matt: the codebase is a bit of a mess here; you should avoid concatenating Unicode and bytestring objects, but docutils gets away with it because their bytestrings are all ASCII only (and anything that contains non-ASCII is already defined as a Unicode object). – Martijn Pieters Apr 24 '17 at 11:38
  • (and I was wrong about my 'earlier revision' assertion earlier on; I was looking at the wrong variable). – Martijn Pieters Apr 24 '17 at 11:40
  • I'm annoyed that I didn't notice that. Thanks for taking the trouble to look through the module itself. – Paul Patterson Apr 24 '17 at 11:41