0

Trying to use regex to turn markdown UL into HTML. Below, I've illustrated what an example input could look like, and the result should be two different ul elements, the first having three li elements and the second having two.

- Item 1
- Item 2
- Item 3

A second list:
- Item 1
- Item 2

Running into this infuriating issue where the following regex doesn't seem to be working as intended. The problem is that it doesn't seem to be recognising the \n char, since the first regex /((- |ー).*(\n|$))+/g seems to be only getting matches when there is end of string ($).

.replaceAll(/((- |ー).*(\n|$))+/g, function(match) {
    return `<ul>${match}</ul>`.replaceAll(/(- |ー).*/g, function(match) {
        return `<li>${match.match(/(?<=(- |ー)).*/)}</li>`
    });
});

I don't understand what the problem is, and I tested the expression in Regexr where it works perfectly.

Here is the full context if it would be helpful:

parse(markdown) {
    return markdown

    // Clean HTML brackets
        .replaceAll('<', '&lt')
        .replaceAll('>', '&gt')

    // Change markdown links into html links
        .replaceAll(/\[.*?\]\(.*?\)/g, function (match) {
            return `<a href='${match.match(/(?<=\().*?(?=\))/)[0]}' target='_blank'>${match.match(/(?<=\[).*?(?=\])/)[0]}</a>`;
    })

    // Headings
        .replaceAll(/(^|\n)# .*/g, function (match) {
            return `<h1>${match.match(/(?<=# ).*/)}</h1>`
        })
        .replaceAll(/(^|\n)## .*/g, function (match) {
            return `<h2>${match.match(/(?<=# ).*/)}</h2>`
        })
        .replaceAll(/(^|\n)### .*/g, function (match) {
            return `<h3>${match.match(/(?<=# ).*/)}</h3>`
        })

    // Ordered lists
        .replaceAll(/((- |ー).*($|\n))+/g, function(match) {
            return `<ul>${match}</ul>`.replaceAll(/(- |ー).*/g, function(match) {
                return `<li>${match.match(/(?<=(- |ー)).*/)}</li>`
            });
        });

Note that the \n char is recognised perfectly fine in the // Headings section.

(Edit to clarify that this is in VueJS, hence using this method definition syntax in a component's methods object)

Piturnah
  • 494
  • 1
  • 4
  • 11
  • "Infuriating" is the right word. This approach is not likely to bear fruit. See: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Tom Jul 05 '21 at 22:25
  • There are libraries for parsing markdown. Search npm. Writing your own is a great way to waste a lot of time getting nowhere. – Tom Jul 05 '21 at 22:26
  • @Tom is right, not sure why you're not using a markdown library. I'm only woking on this because i'm taking a break and treating this like a puzzle. Anyway, why do you say "since the first regex /((- |ー).*(\n|$))+/g seems to be only getting matches when there is end of string"? I just ran your updated example input through your unordered list code. It picks up two ordered lists end-to-end, one per match". But you have other issues in your logic because it produces wonky HTML. – Inigo Jul 05 '21 at 23:22

2 Answers2

1

It's not ignoring the \n at all. Your regex pattern is simply matching the entire list as a single match and the \n|$ is simply matching the last one -- i.e. you are getting one long match, not the three separate matches, one for each list item, as you want.

In fact, you were mistaken about "in Regexr where it works perfectly." Go try it there again. You get one long match, not three.

The reason for this is that regexes are by default greedy. You can change that by appending ? to the quantifier to make it lazy instead of greedy:

/((- |ー).*(\n|$))+?/g

Try it with and without the ? in Regexr so that you can see the difference, and also so that you can learn how to interpret Regexr results since you missed this last time.

ℹ️ This doesn't fix your list item conversion to HTML; you have other problems in your code, but I'm answering the question you are asking.

There is another approach you can take the yields the same results:

/((- |ー).*($))+/gm

This approach switches to multiline mode, which means the input is treated as separate lines. In this mode you don't try to mach \n since they don't appear; you just match the end of each line with $.

Inigo
  • 12,186
  • 5
  • 41
  • 70
  • Hi, thank you very much for your detailed response. However, it seems I didn't make myself clear enough originally - if there is a single list, I want to get one long match because I want to put each list in the markdown in its own `ul` element. In the subsequent regular expressions I am getting the individual list items and putting them as `li` elements inside the aforementioned `ul` element. Hopefully that clears up the confusion. – Piturnah Jul 05 '21 at 23:02
  • 1
    ok, got. let me take another look. Can you make your question more clear? This is important for future peop;e who find this in SO searched. – Inigo Jul 05 '21 at 23:04
  • 1
    @Piturnah And I guarantee to you that Regexr works the same as Javascript, because that's what he's using behind the scenes. – Inigo Jul 05 '21 at 23:08
  • 1
    Thank you, I have updated the question accordingly. – Piturnah Jul 05 '21 at 23:11
  • My name is Mulan! I did it to save my father. – Inigo Jul 07 '21 at 17:57
-1

You are over complexing your code. Try this:

1- Apply all filters other than ordered list

2- Find all list matches

3- Map them to li:

parse(markdown) {
    const cleansedString = markdown

    // Clean HTML brackets
        .replaceAll('<', '&lt')
        .replaceAll('>', '&gt')

    // Change markdown links into html links
        .replaceAll(/\[.*?\]\(.*?\)/g, function (match) {
            return `<a href='${match.match(/(?<=\().*?(?=\))/)[0]}' target='_blank'>${match.match(/(?<=\[).*?(?=\])/)[0]}</a>`;
    })

    // Headings
        .replaceAll(/(^|\n)# .*/g, function (match) {
            return `<h1>${match.match(/(?<=# ).*/)}</h1>`
        })
        .replaceAll(/(^|\n)## .*/g, function (match) {
            return `<h2>${match.match(/(?<=# ).*/)}</h2>`
        })
        .replaceAll(/(^|\n)### .*/g, function (match) {
            return `<h3>${match.match(/(?<=# ).*/)}</h3>`
        });

    const listMatches = Array.from(cleansedString.matchAll(/((- |ー)(.*)($|\n))/g));
    const listHtml = listMatches.map((matches) => `<li>${matches[3]}</li>`);

    return `<ul>${listHtml.join('')}</ul>`;
}
Not A Bot
  • 29
  • 4
  • This may or may not work (I haven't examined the code) and I commend your helpfulness, but on SO you should answer only the question asked. The question was about why their regex wasn't matching `\n`. People in the future will find this question looking for an answer to that; not for how to better parse markdown and convert it into HTML – Inigo Jul 05 '21 at 23:03