Regex to parse wrongly formatted unordered list

Question

I'm dealing with a website migration. Unfortunately the unordered list elements on the old website are formatted without using the ul tag. So I would like to parse the following wrong markup to a common html ul markup:

<p class="bodytext">
 •&nbsp;&nbsp;&nbsp; This is some random text.<br>
 •&nbsp;&nbsp;&nbsp; This is some other random text.<br>
 •&nbsp;&nbsp;&nbsp; This is another random text.
</p>

Important facts:

We are in the context of a post element, so there are a lot of bodytext classes
The last list element has no br tag
All list elements have this "bull" and 3x "&nbsp"
The amount of list elements is variable

I thought about a regex but I have no idea to tackle the mentioned problems, especially how to "detect" where to replace and how to match the last list item without

Any help would be appreciatted.

Possible duplicate of [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) — Justinas, Sep 20 '18 at 07:09
Lot's of manual work you have here. 1. Replace all `
` with `
`. Use PHPStorm to auto-edit ending tag. 2. Replace all `dot ` to `
` and append `
` to end of line (again use editor with multi-edit feature) — Justinas, Sep 20 '18 at 07:11
The posted question does not appear to include [any attempt](https://idownvotedbecau.se/noattempt/) at all to solve the problem. StackOverflow expects you to [try to solve your own problem first](https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users), as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific problem you're having in a [MCVE]. For more information, please see [ask] and take the [tour]. — CertainPerformance, Sep 20 '18 at 07:18

score 2 · Answer 1 · answered Sep 20 '18 at 07:44

As stated in the comments, parse HTML with regexp is a bad idea.

If you understand this, and still want to continue using regexp, you can do something like this:

1. Inserting the <ul></ul> tags:

regexp:

(<p class="bodytext">)(.+?)(<\/p>)

replace with:

<ul>\2</ul>

Gives

<ul>
    •&nbsp;&nbsp;&nbsp; This is some random text.<br>
    •&nbsp;&nbsp;&nbsp; This is some other random text.<br>
    •&nbsp;&nbsp;&nbsp; This is another random text.
</ul>

DEMO

2. Inserting the <li></li> tags

Regexp:

(•&nbsp;&nbsp;&nbsp; )(.+?)(<br>|)(\n)

Replace with:

<li>\2</li>\n

Gives:

<ul>
    <li>This is some random text.</li>
    <li>This is some other random text.</li>
    <li>This is another random text.</li>
</ul>

DEMO

Do you have a suggestion, how to solve that without regex? – EmWe Sep 20 '18 at 08:02 — EmWe, Sep 20 '18 at 08:02

Michał Turczyn · Answer 2 · 2018-09-20T07:48:26.693

You could do it in two steps:

Use pattern: <([^ ?]+).*>((?=[^<]*•   )[\w\W]+)<\/(\1)>.

<([^ ?]+).*> and <\/(\1)> assure that you will have matching tags (opening and closing), thanks to backreference to first capturing group: \1.

It will match only elements that contains the list thanks to positive lookahead: (?=[^<]*•   ).

Demo

In second capturing group, you'll have all list elements, so you can replace it with: <ul>\2</ul>. Now you'll have something like this:

<ul>
  •&nbsp;&nbsp;&nbsp; This is some random text.<br>
  •&nbsp;&nbsp;&nbsp; This is some other random text.<br>
  •&nbsp;&nbsp;&nbsp; This is another random text.
</ul>

Replace all occurences of •    with <li>

Regex to parse wrongly formatted unordered list

2 Answers2