2

Using Python with Matthew Barnett's regex module.

I have this string:

The well known *H*rry P*tter*.

I'm using this regex to process the asterisks to obtain <em>H*rry P*tter</em>:

REG = re.compile(r"""
(?<!\p{L}|\p{N}|\\)
\*
([^\*]*?) # I need this part to deal with nested patterns; I really can't omit it
\*
(?!\p{L}|\p{N})
""", re.VERBOSE)

PROBLEM

The problem is that this regex doesn't match this kind of strings unless I protect intraword asterisks first (I convert them to decimal entities), which is awfully expensive in documents with lots of asterisks.

QUESTION

Is it possible to tell the negative class to block at internal asterisks only if they are not surrounded by word characters?

I tried these patterns in vain:

  • ([^(?:[^\p{L}|\p{N}]\*[^\p{L}|\p{N}])]*?)
  • ([^(?<!\p{L}\p{N})\*(?!\p{L}\p{N})]*?)
  • 1
    Can you please share the replacing code itself? Also, maybe you want `re.sub(r'\B\*\b([^*]*(?:\b\*\b[^*]*)*)\b\*\B', r'\1', s)`? (If it is Python 2.x, add `u` prefix to enforce a `re.UNICODE` flag). – Wiktor Stribiżew Oct 06 '16 at 20:18
  • 1
    Do you mean [**`\B(\*)([^*]*(?:\*\b[^*]*)*)(\*)\B`**](https://regex101.com/r/HKmWDg/1)? – revo Oct 06 '16 at 20:23
  • @WiktorStribiżew it's just a re.sub like yours, repeated twice to match one level of nesting. I'll try your suggestion now –  Oct 06 '16 at 20:24
  • Please what do you mean by *nested patterns* here? – revo Oct 06 '16 at 20:28
  • @WiktorStribiżew it's perfect! thank'you very much. If you do it as an answer I'll accept it –  Oct 06 '16 at 20:29
  • @revo your regex doesn't work (I get this: `Hey *!` with a string like `'Hey *Har*ry P*tter*!'`). Nested patterns are something like `*Harry *wizard* Potter*`, but they are not a problem, I already solved it by repeating the sub twice –  Oct 06 '16 at 20:30
  • @revo your regex works, I didn't notice it required capture group 2 instead of 1 –  Oct 06 '16 at 20:35
  • 1
    Well if you found your answer there is no more discussion left. My proposed RegEx works if replacement string is `\2`. The one I suggested is the same as @WiktorStribiżew , but since he is a *fast editor guy* I didn't see his edit before posting my comment. – revo Oct 06 '16 at 20:37

1 Answers1

0

I suggest a single regex replacement for the cases like you mentioned above:

re.sub(r'\B\*\b([^*]*(?:\b\*\b[^*]*)*)\b\*\B', r'<em>\1</em>', s)

See the regex demo

Details:

  • \B\*\b - a * that is preceded with a non-word boundary and followed with a word boundary
  • ([^*]*(?:\b\*\b[^*]*)*) - Group 1 capturing:
    • [^*]* - 0+ chars other than *
    • (?:\b\*\b[^*]*)* - zero or more sequences of:
      • \b\*\b - a * enclosed with word boundaries
      • [^*]* - 0+ chars other than *
  • \b\*\B - a * that is followed with a non-word boundary and preceded with a word boundary

More information on word boundaries and non-word boundaries:

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563