0

I need to target the starting tag of the last top level LI in a list that may or may-not contain sublists in various positions - without using CSS or Javascript.

Is there a simple/elegant regexp that can help with this? I'm no guru w/ them, but it appears the need for greedy/non-greedy selectors when I'm selecting all the middle text (.*) / (.+) changes as nested lists are added and moved around in the list - and this is throwing me off.

$pattern = '/^(<ul>.*)<li>(.+<\/li><\/ul>)$/';
$replacement = '$1<li id="lastLi">$3';

Perhaps there is an easier approach?? converting to XML to target the LI and then convert back?

ie: Single Element

<ul>
    <li>TARGET</li>
</ul>

Multiple Elements

<ul>
    <li>foo</li>
    <li>TARGET</li>
</ul>

Nested Lists before end

<ul>
    <li>
        foo
        <ul>
            <li>bar</li>
        </ul>
    <li>
    <li>TARGET</li>
</ul>

Nested List at end

<ul>
    <li>foo</li>
    <li>
        TARGET
        <ul>
            <li>bar</li>
        </ul>
    </li>
</ul>
veilig
  • 5,085
  • 10
  • 48
  • 86

3 Answers3

6

You should never use regex to parse HTML. Especially in this particular case (recursive tags).

Main reason overall is that HTML is not a regular language.

On top of the fact that HTML is not a regular language and can't be 100% correctly parsed with regex, the task to regex-parse HTML "well enough" is complicated enough that you're more likely than not going to have bugs in your code.

Instead, use a designated HTML parser.

Community
  • 1
  • 1
DVK
  • 126,886
  • 32
  • 213
  • 327
  • +1, and even more so, in this very case it would be especially difficult to do with regex. Regex doesn't work well with recursive structures. (And no, the "recursive regex" stuff some regex engines offer isn't very nice to use.) – Matti Virkkunen Jun 07 '10 at 20:26
  • 1
    +1, this is extraordinarily difficult if you want the top level recursion. And I want to know who downvoted this, because in this case it is entirely correct. It isn't ALWAYS the case that you shouldn't use regex to parse HTML, but here it definitely is. – Platinum Azure Jun 07 '10 at 20:28
  • 1
    While that link is a good (as in humorous) read, it doesn't tell the OP much as to "why" s/he shouldn't do such a thing. I find such answers (posting only a link to "the html+regex thread") to be of the same type as a LMGTFY link: not the spitit of SO. Hence my down vote. – Bart Kiers Jun 07 '10 at 20:30
  • @Bart - fair enough, though Way Harsh Man :) I added more details to the answer. – DVK Jun 07 '10 at 20:33
  • 2
    @DVK: note that I first down voted and then typed up the reason for it. But I am of course free to only down vote you without providing a reason! :). Anyway, I removed it since you expanded your answer as to why one shouldn't do it. – Bart Kiers Jun 07 '10 at 20:34
  • @Bart - why Thank you! The Spirit of SO lives! :) – DVK Jun 07 '10 at 20:46
  • I'm getting pretty tired of The Rant too, but I'm afraid if people stop invoking it, they'll go back to repeating the "now you have two problems" quote, and that's even more annoying. :) – Alan Moore Jun 07 '10 at 20:50
  • @Alan - So they will be asking how to parse HTML with regex, AND quoting the two problems quote - and then you'll have TWO problems. *DVK ducks to avoid something heavy thrown at him* – DVK Jun 07 '10 at 22:08
1

Use an html parser not a regex.

Stuart
  • 575
  • 1
  • 3
  • 12
1

XML conversion and DOM parsing is the easiest way if there is enough confidence about what kind of HTML data must be processed through.

Ville Laitila
  • 1,187
  • 11
  • 18