2

I am trying to parse a markdown content with the use of regex. To grab bold and italic items from the input, I'm currently using a regex:

/(\*\*)(?<bold>[^**]+)(\*\*)|(?<normal>[^`*[~]+)|\*(?<italic>[^*]+)\*/g

Regex101 Link: https://regex101.com/r/2zOMid/1

The problem with this regex are:
  • if there is a single * in between a bold text content, the match is breaked
  • if there are long texts like ******* anywhere in between the match is broken

#####: tried with: I tried removing the [^**] part in the bold group but that messed up the bold match with finding the last ** occurrence and including all `**`` chars within

What I want to have:
  • accurate bold
  • * allowed inside bold
  • accurate italics

Language: Javascript

Assumptions:

Bold text wrapped inside ** Italic text wrapped inside *

Tyzoid
  • 1,072
  • 13
  • 31
Kiran Parajuli
  • 820
  • 6
  • 14
  • 1
    Do not use a single regex here since matches are overlapping. Use bold regex first, then italics. – Wiktor Stribiżew Jul 16 '22 at 09:09
  • yes, i'm trying to do the same. for that the bold match in the above regex should allow to contain single `*` char within. if I do that the bold match is messed up. can i do that properly with regex? – Kiran Parajuli Jul 16 '22 at 11:13
  • Shouldn't one, by markdown rules, in need to literally show an asterisk `*` escape it? `***\****` for the exact reason? – Roko C. Buljan Jul 16 '22 at 11:19
  • for me, `*****` & `**\***` means a normal text. If we want just an asterisk as bold maybe using raw HTML is better (markdown supports that). but if the input is like `**ab*cd**` then `ab*cd` should be a match. – Kiran Parajuli Jul 16 '22 at 11:40

5 Answers5

2

[^**] will not avoid two consecutive *. It is a character class that is no different from [^*]. The repeated asterisk has no effect.

The pattern for italic should better come in front of the normal part, which should capture anything that remains. This could even be a sole asterisk (for example) -- the pattern for normal text should allow this.

It will be easier to use split and use the bold/italic pattern for matching the "delimiter" of the split, while still capturing it. All the rest will then be "normal". The downside of split is that you cannot benefit from named capture groups, but they will just be represented by separate entries in the returned array.

I will ignore the other syntax that markdown can have (like you seem to hint at with [ and ~ in your regex). On the other hand, it is important to deal well with backslash, as it is used to escape an asterisk.

Here is the regular expression (link):

(\*\*?)(?![\s\*])((?:[\s*]*(?:\\[\\*]|[^\\\s*]))+?)\1

Here is a snippet with two functions:

  • a function that first splits the input into tokens, where each token is a pair, like ["normal", " this is normal text "] and ["i", "text in italics"]
  • another function that uses these tokens to generate HTML

The snippet is interactive. Just type the input, and the output will be rendered in HTML using the above sequence.

function tokeniseMarkdown(s) {
    const regex = /(\*\*?)(?![\s\*])((?:[\s*]*(?:\\[\\*]|[^\\\s*]))+?)\1/gs;
    const styles = ["i", "b"];
    // Matches follow this cyclic order: 
    //   normal text, mark (= "*" or "**"), formatted text, normal text, ...
    const types = ["normal", "mark", ""];
    return s.split(regex).map((match, i, matches) =>
        types[i%3] !== "mark" && match &&
            [types[i%3] || styles[matches[i-1].length-1], 
             match.replace(/\\([\\*])/g, "$1")]
    ).filter(Boolean); // Exclude empty matches and marks
}

function tokensToHtml(tokens) {
    const container = document.createElement("span");
    for (const [style, text] of tokens) {
        let node = style === "normal" ? document.createTextNode(text) 
                                      : document.createElement(style);
        node.textContent = text;
        container.appendChild(node);
    }
    return container.innerHTML;
}


// I/O management
document.addEventListener("input", refresh);

function refresh() {
    const s = document.querySelector("textarea").value;
    const tokens = tokeniseMarkdown(s);
    document.querySelector("div").innerHTML = tokensToHtml(tokens);
}
refresh();
textarea { width: 100%; height: 6em }
div { font: 22px "Times New Roman" }
<textarea>**fi*rst b** some normal text here **second b**  *first i* normal *second i* normal again</textarea><br>

<div></div>
trincot
  • 317,000
  • 35
  • 244
  • 286
  • THX @trincot for the comment and the snippet. I'm trying to get this just by regex. At one point in your answer you mentioned about `which should capture anything that remains.` can i do that with regex? Like /oneGroup|anotherGroup|everyThingElse/`. – Kiran Parajuli Jul 16 '22 at 12:47
  • 1
    In JavaScript you can only do that if *everyThingElse* is a really smart pattern that captures anything up to the point where the previous groups would start a match. This is not maintainable as you would add more groups and/or more complexity to them. A second option is to let *anything else* just grab **one** character. That means you'll have a lot of single character matches that are consecutive, and then you need to join those again. – trincot Jul 16 '22 at 12:57
  • *"I'm trying to get this just by regex"*: yes, I just added a snippet to illustrate how you can use the result of the regex. The regex itself will identify all bold and italic parts, and also the surrounding marks (`**`), and it will skip normal parts. This is what I understood was the goal of your question -- to grab bold and italic parts. – trincot Jul 16 '22 at 13:00
2

There was some discussion in the chat going on. Just to have it mentioned, there is no requirement yet on how to deal with escaped characters like \* so I didn't take attention of it.

Depending on the desired outcome I'd pick a two step solution and keep the patterns simple:

str = str.replace(/\*\*(.+?)\*\*(?!\*)/g,'<b>$1</b>').replace(/\*([^*><]+)\*/g,'<i>$1</i>');

Here is the JS-demo at tio.run

Myself I don't think it's a good idea to rely on the amount of the same character for distinguishing between kinds of replacement. The way how it works finally gets a matter of taste.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
1

Looking some more about the negative lookaheads, I came up with this regex:

/\*\*(?<bold>(?:(?!\*\*).)+)\*\*|`(?<code>[^`]+)`|~~(?<strike>(?:(?!~~).)+)~~|\[(?<linkTitle>[^]]+)]\((?<linkHref>.*)\)|(?<normal>[^`[*~]+)|\*(?<italic>[^*]+)\*|(?<tara>[*~]{3,})|(?<sitara>[`[]+)/g

Regex101

this pretty much works for me as per my input scenarios. If someone has a more optimized regex, please comment.

Kiran Parajuli
  • 820
  • 6
  • 14
0

italic: ((?<!\s)\*(?!\s)(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+?|[^\*\*]+?)(?<!\s)\*)

(?<!\s)\*(?!\s) means matching the start * with no space around,
(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+? means match ** with even appearance, by which negalates meaningless ** inside intalic.
|[^\*\*]+? means if there's no match for one or more ** pair, match anything except a single **.(this "or" order is important)
(?<!\s)*) means matching the end * with no space ahead
And ?: is non-capturing group in js, you can delete it if not needing

bold: ((?<!\s)\*\*(?!\s)(?:[^\*]+?|(?:[^\*]*(?:(?:\*[^\*]*){2})+?)+?)(?<!\s)\*\*)

Similar to italic, except the order of * pair and other character.

Together you can get:
((?<!\s)\*(?!\s)(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+?|[^\*\*]+?)(?<!\s)\*)|((?<!\s)\*\*(?!\s)(?:[^\*]+?|(?:[^\*]*(?:(?:\*[^\*]*){2})+?)+?)(?<!\s)\*\*)

See the result here: https://regex101.com/r/9gTBpj/1

段奕含
  • 1
  • 1
  • Tried the regex, no match found. See https://regex101.com/r/EX5cKI/1 – Kiran Parajuli Aug 23 '22 at 13:41
  • (?<!\s)\*(?!\s) means matching the start * with no space around, so if it's around space, it won't match. if you don't want this feature, just try `(\*(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+?|[^\*\*]+?)\*)|(\*\*(?:[^\*]+?|(?:[^\*]*(?:(?:\*[^\*]*){2})+?)+?)\*\*)` – 段奕含 Aug 23 '22 at 16:09
  • view it here https://regex101.com/r/9gTBpj/1 – 段奕含 Aug 23 '22 at 16:17
0

You can choose the tags depending on the number of asterisks. (1 → italic, 2 → bold, 3 → bold+italic)

function simpleMarkdownTransform(markdown) {
  return markdown
    .replace(/</g, '&lt') // disallow tags
    .replace(/>/g, '&gt')
    .replace(
      /(\*{1,3})(.+?)\1(?!\*)/g,
      (match, { length: length }, text) => {
        if (length !== 2) text = text.italics()
        return length === 1 ? text : text.bold()
      }
    )
    .replace(/\n/g, '<br>') // new line
}

Example:

simpleMarkdownTransform('abcd **bold** efgh *italic* ijkl ***bold-italic*** mnop') 
// "abcd <b>bold</b> efgh <i>italic</i> ijkl <b><i>bold-italic</i></b> mnop"
Yukulélé
  • 15,644
  • 10
  • 70
  • 94