0

I'm trying to match and replace broken HTML using a regex, but I've done a couple of full circles with grouping and lookbacks and quantifiers. I'm struggling to match every scenario.

JavaScript, because the issue is triggered in a Web client browser HTML editor.

The broken HTML is specific - any text between a closing LI and the closing list UL or OL, that is not properly formed as a list item.

For instance, this piece here, from the greater example underneath:

    </li>
        bbb<strong>bbbb</strong><strong>bbb&nbsp;&nbsp;&nbsp; <span style="text-decoration: underline;"><em>bbbbb</em></span></strong>=0==
</ul>

Here is the full example of where the issue could exist:

<ul>
    <li>1111</li>
    <li>Could be anything here</li>
    <li>aaaa</li>
        bbb<strong>bbbb</strong><strong>bbb&nbsp;&nbsp;&nbsp; <span style="text-decoration: underline;"><em>bbbbb</em></span></strong>=0==
</ul>
<ol>
    <li>more?<li>
    <li>echo</li>
</ol>

This is what I intend the HTML to look like using a match + replace.

<ul>
    <li>1111</li>
    <li>Could be anything here</li>
    <li>aaaabbb<strong>bbbb</strong><strong>bbb&nbsp;&nbsp;&nbsp; <span style="text-decoration: underline;"><em>bbbbb</em></span></strong>=0==
</ul>
<ol>
    <li>more?<li>
    <li>echo</li>
</ol>

A few expressions I've tried are the following, but depending on these (or slight variations), I'm matching too much or not correctly or something:

/<\/li>.*?<\/[ou]l>/mig
/<\/li>([\s\n]*[\w!\.?;,<:>&\\\-\{\}\[\]\(\)~#'"=/]+[\s\n]*)+<\/[ou]l>/mig
/<\/li>([\s\n]*[^\s\n]+[\s\n]*)+<\/[ou]l>/i

Searched for a couple of days on and off, no luck.. I realise I'm probably asking something answered hundreds of times before.

phil
  • 1
  • I left the closing tag off the "This is what I intend the HTML to look like" bit.. the last list item should have the
  • aaaabbbbbbbbbb    bbbbb=0==
  • – phil Dec 03 '13 at 08:18
  • argh! do NOT use **regex** to parse **HTML** bro, use **HTML parser**. Well this might be different case as you are trying to repair **broken HTML** but really does the same issue occure that many times to use **regex**? It also seems like you wanted to remove the closing tag ``, which can be done using simple text editor and `ctrl+f` with replace, which doesn't require any patterns and it shouldn't mess up your HTML. – Tafari Dec 03 '13 at 08:25
  • Thanks, that's what I needed to hear (argh! included). – phil Dec 03 '13 at 08:48
  • hah glad it helped : P – Tafari Dec 03 '13 at 09:53