4

I'm sure someone already asked this question but I don't know what words to search for in google to find these answers.

I have to "translate" a text with markup to html (or rtf or xaml). The markup for "bold" is *. If I'd like the bold text to contain a literal * I have to mask it with a back slash.

So, the marked-up text...

This is *ju\*st* a test.

...should translate to "This is ju*st a test."

I'm looking for a regex pattern to get all the matches to "translate" to bold inside my marked-up text.

Right now I'm stuck with this one (a literal star followed by one or more characters that are not a star (as few as possible), followed by a literal star)

\*[^*]+?\*

But how can I enhance the "one or more characters that are not a star" part to don't stop at stars that are preceded with a backslash?

I want to use this regex in a .NET project, in case there are differences between the languages.

Nostromo
  • 1,177
  • 10
  • 28
  • `\*(\\\*|[^*])+?\*` -- either a backlslash-star or character which is not a star? – AlexP Feb 01 '19 at 21:57
  • You need `(?<=(?<!\\)(?:\\{2})*)\*[^\\*]*(?:\\.[^\\*]*)*\*`. See [.NET regex demo](http://regexstorm.net/tester?p=%28%3f%3c%3d%28%3f%3c!%5c%5c%29%28%3f%3a%5c%5c%7b2%7d%29*%29%5c*%5b%5e%5c%5c*%5d*%28%3f%3a%5c%5c.%5b%5e%5c%5c*%5d*%29*%5c*&i=This+is+*ju%5c*st*+a+test.) Do not use regex101 to test .NET regex patterns, it does not support .NET regex syntax. – Wiktor Stribiżew Feb 01 '19 at 22:01
  • 2
    You can't just use `\*(\\\*|[^*])+?\*` because this pattern [does not make sure](http://regexstorm.net/tester?p=%5c*%28%5c%5c%5c*%7c%5b%5e*%5d%29%2b%3f%5c*&i=This+is+%5c*+*ju%5c*st*+a+test.) the first `*` matched is not an escaped asterisk. – Wiktor Stribiżew Feb 01 '19 at 22:06
  • @Wiktor: Can you please explain your long regex pattern a little bit for a newbie? – Nostromo Feb 01 '19 at 22:12

2 Answers2

1

You want to match from a markup star to another markup star. In your markup language, a literal star is actually not only *, but \*. In regex, this translates by \\\*: a backslash, that must be escaped, then a star, that must be escaped too.

Therefore, you need to specify in your pattern that you're looking for a markup star, as opposed to a literal star.

\*.*[^\\]\*

\*             a markup star
  .*           followed by any character
    [^\\]\*    then a markup star, that is, one not escaped by a backslash

This is a little off though, because .* is greedy, so in "*ju\*st* *ju\*st*, it's gonna match the whole string, from the first to the last stars.

You can use the lazy/non-greedy version of the star modifier: *? in most engines. So it becomes:

\*.*?[^\\]\*

\*             a markup star
  .*?          followed by any character, but as few as possible
     [^\\]\*   then a markup star, that is, one not escaped by a backslash

Small try with Python:

>>> s = r"*ju\*st* *ju\*st*"
>>> re.match(r"\*.*[^\\]\*", s)
<re.Match object; span=(0, 17), match='*ju\\*st* *ju\\*st*'>
>>> re.match(r"\*.*?[^\\]\*", s)
<re.Match object; span=(0, 8), match='*ju\\*st*'>

If your regex engine does not support lazy modifiers, you'll need to explicit this behaviour:

\*([^*]|\\\*)*[^\\]\*

\*                       a markup star
  (                      then either...
   [^*]                  ...any character but a star...
       |                 ...or...
        \\\*             ...a star prefix by a backslash, ie a literal star
            )*           any number
              [^\\]\*    then a markup star
Right leg
  • 16,080
  • 7
  • 48
  • 81
1

You may use

(?<=(?<!\\)(?:\\{2})*)\*[^\\*]*(?:\\.[^\\*]*)*\*

See the .NET regex demo.

Details

  • (?<=(?<!\\)(?:\\{2})*) - a positive lookbehind that makes sure there is no \ escape char right before the current location. In other words, it matches a location that is immediately preceded with:
    • (?<!\\) - no \ char followed with
    • (?:\\{2})* - any zero or more repetitions of double backslashes
  • \* - a * char
  • [^\\*]* - zero or more chars other than \ and *
  • (?: - start of a non-capturing group matching...
    • \\. - any char (other than a newline, compile the pattern with RegexOptions.Singleline to allow any escaped char) escaped with a \ char
    • [^\\*]* - zero or more chars other than \ and *
  • )* - zero or more times
  • \* - a * char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thank you. This rather complicated Regex works in all cases I tested it with, but I don't really know why. I think the `[^\\*]*` from the fourth primary bullet point is capturing all the text between the two markup * chars. But with this Regex part, shouldn't the capturing stop right infront of any backslash or star, whether they succeed one another or not, whether they come in the right order or not? – Nostromo Feb 08 '19 at 14:04
  • @Nostromo You need to see the whole `\*[^\\*]*(?:\\.[^\\*]*)*\*` part. It matches from `*` till the first met `*` matching any escaped chars in between. `\*` matches `*`, `[^\\*]*` matches any chars but ``\`` and `*`, but then `(?:\\.[^\\*]*)*` matches *0 or more repetitions* of any escaped char + any chars other than `*` and ``\``, thus, making it possible to match all escaped chars after an unescaped `*`. Then, `\*` matches a `*`. It is an [unrolled](https://stackoverflow.com/questions/38018210) `\*([^\\*]|\\.)*\*` pattern that might be a bit simpler to analyze, but is less efficient. – Wiktor Stribiżew Feb 08 '19 at 14:08
  • Right now I don't care much about efficiency, I rather understand a Regex, even when looking at the code in one year. So, for now I'm sticking to the "rolling" Regex you suggested. Thank you. – Nostromo Feb 08 '19 at 14:24
  • @Nostromo If you use `\*([^\\*]|\\.)*\*` in Java, you are sure to get stack overflow exception. Use the unrolled version. Even the non-capturing group - `\*(?:[^\\*]|\\.)*\*` - won't help one day. – Wiktor Stribiżew Feb 08 '19 at 14:26