Regex "replace until"

Question

This is a bit more complicated (to me!) than similar questions.

I'm trying to copy some dates with Regex in some old HTML, into another location, but have a problem with the replacements extending outside the required block, when I repeat the search and replace.

In the example below, each block <ul> .. </ul> represents a specific date that is included with first <li>... I want to copy "13 Oct 2005" into the subsequent <li>.., and likewise, "14 Oct 2005" into the subsequent <li>.. too, but not outside its <ul> .. </ul> block.

This I can do using the following Regex (used by Funduc's Search & Replace utility for Windows, apparently "a subset of UNIX grep notation)":

Search: <li>+[0-9] *<b>*[]<li><b>
Replace: <li>%1 %2\<b>%3\<li>%1 %2\<b>

+[0-9] is one or more numeric; * is any alphabetical; *[] is anything; %1 .. %4 are replacement positions.

Here is my original HTML

<ul>
<li>13 Oct 2005<b>Title One</b>
Some text
<p>
<li><b>Title Two</b>
Some more text
<p>

</ul>

<ul>
<li>14 Oct 2005<b>Title 3</b>
Another line of text
<p>
<li><b>Title 4</b>
Yet another line of text
<p>
<li><b>Title 5</b>
Some text
<p>

</ul>

After my first script run, this correctly gives me:

<ul>
<li>13 Oct 2005<b>Title One</b>
Some text
<p>
<li>13 Oct 2005<b>Title Two</b>
Some more text
<p>

</ul>

<ul>
<li>14 Oct 2005<b>Title 3</b>
Another line of text
<p>
<li>14 Oct 2005<b>Title 4</b>
Yet another line of text
<p>
<li><b>Title 5</b>
Some text
<p>

</ul>

But after my second script run, the 13 Oct 2005 is added incorrectly to the next <ul> .. <ul> block:

<ul>
<li>13 Oct 2005<b>Title One</b>
Some text
<p>
<li>13 Oct 2005<b>Title Two</b>
Some more text
<p>

</ul>

<ul>
<li>14 Oct 2005<b>Title 3</b>
Another line of text
<p>
<li>14 Oct 2005<b>Title 4</b>
Yet another line of text
<p>
<li>13 Oct 2005<b>Title 5</b>    <-- wrong !!!
Some text
<p>

</ul>

I have about 20,000 <ul> .. <ul> blocks (hence the script), and each block contains between 1-10 <li> tags with titles. I assumed it couldn't be done in one pass.

Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — arco444, Jan 16 '17 at 14:09
I don't know what this regex syntax is but it makes no sense for grep. — melpomene, Jan 16 '17 at 14:13
I think it is a bit more complicated that the similar questions, and I've added some comments to the syntax above. — iantresman, Jan 16 '17 at 15:13

score 0 · Answer 1 · edited May 23 '17 at 10:29

tl;dr Regular expressions are the wrong tool here; use something stronger.

Please, no

Parsing HTML with regex is a bad idea. It's even in the SO regex reference: Reference - What does this regex mean?

So don't do it.

Yes, I see you're still here

If you're completely in charge of your HTML files, and you don't want a general-purpose tool or anything, I can't stop you from working with regexes.

Still, no

As you've noticed, a single regular expression is the wrong tool here. Crucially, regular expressions, even powerful ones with backreferences and lookahead, don't have extra memory of what they're not looking at.

But that's exactly what you're asking for here! You want to know if you've left the <ul> block, while looking only at the dates and tags that delimit what you're searching for.

What you need is a programming language.

A style point

The regex syntax you use is nothing like unix-style regex syntax. That's forgivable.

What's unforgivable (to any regex enthusiast) is using multi-line regexes without explaining why it's necessary.

Aha! But wait! Now that we've concluded we need a programming language, multi-line regexes are no longer necessary!

So, let's stop using them.

Why are we still here?

At this point, I'm here for self-flagellation: I've concluded that, on one hand, we need a programming language, because we need something stronger than regexes; on the other hand, I won't use an HTML parser because I swore to myself that I would use only regexes.

I am clearly an idiot, and you shouldn't listen to anything I have to say.

Perverse solution

We can fix the file in one pass, once we allow ourselves to use a programming language. We just need to save a little state: the date in the current <ul> block.

Here's a Perl script which fits the terrible goal I set for myself (I'm using 5.22):

#! /usr/bin/perl

use strict;
use warnings;

my $date_re = qr/^<li>(\d+ [[:alpha:]]+ \d+)<b>/;
my $non_date_title_re = qr/^<li>(<b>.*<\/b>)$/;
my $local_date = '';

while (<>) {
    if (/<\/ul>/) {
        $local_date = '';
    } elsif (/$date_re/) {
        $local_date = $1;
    } elsif (/$non_date_title_re/) {
        s/$non_date_title_re/<li>$local_date$1/;
    }

    print;
}

You may not read Perl, but what's going on here is pretty clear: first, save some regexes in local variables, for clarity: one for dates between <li> and , and one for titles with no dates.

For each line in the file, if it contains </ul>, invalidate the local date we've saved. (For the file in your question, this part is not strictly necessary.) If instead the line matches the date regex, save the date for later. If instead the line matches a non-date tile, use substitution to put the date we saved right after the <li>.

But really, use an HTML parser.

@iantresman, it works until it doesn't. I tried to explain in my answer that even if you don't mind using regexes in this HTML file, it still isn't a strong enough tool for your problem. If you want to incorporate regexes in your solution, I showed you how you can do that. — JXG, Jan 18 '17 at 08:46