Regular Expression Without Lookbehind for Markdown Bolding

Question

So I am trying to write a regular expression for JavaScript that will allow me to replace ** with tags as a sort of self rolled Markdown to HTML converter.

e.g.

**bold** -> bold

but

\**not** -> **not** because * was escaped.

I have the following regular expression which seems to work well:

/(?<!\\)(?:\\\\)*(\*\*)([^\\\*]+)(\*\*)/g

However, JS does not support lookbehinds! I rewrote it using lookaheads:

/(\*\*)([^\\\*]+)*(\*\*)(?!\\)(?:\\\\)*/g

but this would require me to reverse the string which is undesirable because I need to support multibyte characters (see here). I am not completely opposed to using the library mentioned in that answer, but I would prefer a solution that does not require me to add one if possible.

Is there a way to rewrite my regular expression without using look behinds?

EDIT:

After thinking about this a little more, I'm even starting to question whether regular expressions is even the best way to approach this problem, but I will leave the question up out of interest.

What should the result be given the input `**foo * bar **` or `**foo \** bar**`? — Jordan Running, May 25 '17 at 14:42
I would expect `foo * bar` and `foo \** bar ` respectively. It's very possible my regular expression does not cover ALL cases, as I have not even been able to test it yet. I'm less concerned about missing edge cases, and more concerned about writing the regular expression with a look behind, but pointing out missed cases is still helpful so thank you! I'm also not super, super concerned with edge cases as this is being used in an administrative tool where strange cases like that are not really a concern. — thatidiotguy, May 25 '17 at 14:45
Do you really expect malformed strings? You know, even a correct parser will yield incorrect results if your string is malformed. Try https://regex101.com/r/J8imcO/1 — Wiktor Stribiżew, May 25 '17 at 15:36
If you have `\**not** **` the `"** **"` will get highlighted anyway. — Wiktor Stribiżew, May 25 '17 at 15:46

Dmitry Egorov · Accepted Answer · 2017-05-25T15:59:20.570

One way to work around missing lookbehinds is to match undesired patterns first and then using alternation match the desired pattern. Then apply conditional replace, substituting the undesired patterns with themselves and the desired ones with what you actually want.

In your particular case this means match \* first and **<something>** only after that. Then use

input.replace(/\\\*|\*\*(.*?)\*\*/, function(m, p1) {
    return m == '\\*' ? m : '<strong>' + p1 + '</strong>';
})

to do the conditional replace.

The real regex is more complex though. First, you need to secure from escaped backslash itself (i.e. \\**bold** should become \\bold). So you need to match \\ separately the same way as you do for \*.

Second, the expression between ** and ** may also contain some escaped asterisks and slashes. To cope with this you need to match \\ and \** explicitly and (using alternation) only after that anything else non-greedily. This may be represented as (?:\\\\|\\\*\*|\*(?!\*)|[\S\s])*?.

Therefore the final regex turns to

\\\\|\\\*|\*\*((?:\\\\|\\\*\*|\*(?!\*)|[\S\s])*?)\*\*

Demo: https://regex101.com/r/Da35r5/1

JavaScript replace demo:

function convert() {
  var md = document.getElementById("md").value;
  var re = /\\\\|\\\*|\*\*((?:\\\\|\\\*\*|\*(?!\*)|[\S\s])*?)\*\*/g;
  var html = md.replace(re, function(match, p1) {
    return match.startsWith('\\') ? match : '<strong>' + p1 + '</strong>';
  });
  document.getElementById("html").value = html;
}

<span style="display:inline-block">
MD
<textarea id="md" cols="20" rows="10" style="display:block">
**bold**
**foo * bar **
**foo \** bar**
**fo\\\\** bar** **
\**bold** **
\\**bold**
** multi
line**
</textarea>
</span>

<span style="display:inline-block">
HTML
<textarea id="html" cols="50" rows="10" style="display:block">
</textarea>
</span>

<button onclick="convert()" style="display:block">Convert</button>

Thank you for the detailed explanation of your general strategy for eliminating look behinds. I am doing the parsing of the string line by line after doing a `split("\n")`, so the multiline was not even necessary! — thatidiotguy, May 25 '17 at 15:44

score 0 · Answer 2 · answered May 25 '17 at 15:38

0

Try this formula, without look(ahead|behind) at all:

(?:(?:[\\])\*\*(?:.+?)\*\*|(?:[^\\\n]|^)\*\*(.+)\*\*)

Demo

answered May 25 '17 at 15:38

Agnius Vasiliauskas

10,935
5
50
70

Jordan Running · Answer 3 · 2017-05-25T16:40:12.967

Consider the following regular expression:

/(.*?)(\\\\|\\\*|\*\*)/g

You can think of this as a tokenizer. It does a non-greedy match of some (or no) text followed by one of the special character sequences \\, \*, and finally **. Matching in this order ensures that weird edge cases like **foo \** bar\\** are handled correctly (foo \** bar\). This makes for a very simple String.prototype.replace with a switch in its replacement function. A boolean bold flag helps us decide if ** should be replaced with  or </strong>.

const TOKENIZER = /(.*?)(\\\\|\\\*|\*\*)/g;

function render(str) {
  let bold = false;
  return str.replace(TOKENIZER, (_, text, special) => {
    switch (special) {
      case '\\\\':
        return text + '\\';
      case '\\*':
        return text + '*';
      case '**':
        bold = !bold;
        return text + (bold ? '<strong>' : '</strong>');
      default:
        return text + special;
    }
  });
}

Here I'm assuming that \\ should become \ and \* should become *, as in normal Markdown parsers. It's not dissimilar to Dmitry's solution, but simpler. See it in action in the below snippet:

const TOKENIZER = /(.*?)(\\\\|\\\*|\*\*)/g;

function render(str) {
  let bold = false;
  return str.replace(TOKENIZER, (_, text, special) => {
    switch (special) {
      case '\\\\':
        return text + '\\';
      case '\\*':
        return text + '*';
      case '**':
        bold = !bold;
        return text + (bold ? '<strong>' : '</strong>');
      default:
        return text + special;
    }
  });
}

// Test
const input = document.getElementById('input');
const outputText = document.getElementById('output-text');
const outputHtml = document.getElementById('output-html');

function makeOutput(str) {
  const result = render(str);
  outputText.value = render(str);
  outputHtml.innerHTML = render(str);
}

input.addEventListener('input', evt => makeOutput(evt.target.value));
makeOutput(input.value);

body{font-family:'Helvetica Neue',Helvetica,sans-serif}
textarea{display:block;font-family:monospace;width:100%;margin-bottom:1em}
div{padding:2px;background-color:lightgoldenrodyellow}

<label for="input">Input</label>
<textarea id="input" rows="3">aaa **BBB** ccc \**ddd** EEE \\**fff \**ggg** HHH**</textarea>

Output HTML:
<textarea id="output-text" rows="3" disabled></textarea>

Rendered HTML:
<div id="output-html"></div>

Regular Expression Without Lookbehind for Markdown Bolding

3 Answers3