Regular Expression formatting help required

Question

I am trying to remove a part of a document on the fly using preg_replace().

/* target example:
        <li id="footer-poweredbyico">
        <img src="//bits.wikimedia.org/skins-1.18/common/images/poweredby_mediawiki_88x31.png" alt="Powered by MediaWiki" width="88" height="31" />
        </li>
    */

$reg = preg_quote('<li id="footer-poweredbyico">.*?</li>');

preg_replace($reg,"",$str);

Ignore any errors in PHP, this question is about how to format the regular expression correctly to remove anything matching the target example opening and closing tags. The contents of the containing HTML tags will be different each time, hence .*? (I think that's wrong).

@Robbie assuming he needs no future flexibility, and he is willing to accept the rather rigid constraints this is going to place, a regexp might be the correct tool for this job. There are times, in my opinion, when a full blown HTML parse actually *is* overkill. — Corbin, Mar 29 '12 at 19:27
Now whether you want to confuse parsing and matching... Your regex lacks delimiters, the `/s` modifier, and blindly applying `preg_quote` on required meta chars is the actual mistake here. — mario, Mar 29 '12 at 19:30

score 4 · Answer 1 · answered Mar 29 '12 at 19:29

The preg_quote function actually does the opposite of what you want: its purpose is to disable all regex-features in a string. So in your case, what you currently have is (roughly) looking for an actual .*? in your HTML, instead of looking for zero or more characters. What you want is:

$str = preg_replace('/<li id="footer-poweredbyico">.*?<\/li>/s', '', $str);

kijin · Answer 2 · 2012-03-29T23:47:22.370

2

preg_quote() will disable all the special characters you used, like .*?.

Try something like:

preg_replace('#<li id="footer-poweredbyico">.*?</li>#s', '', $str);

Now, the difficult question is whether to make this regex "greedy". Right now, it's ungreedy, which means it will break your page if there's another <li> inside the one you're trying to remove. But if you make it greedy, it will remove everything from the beginning of the <li> tag until the end of the last <li> element in the page, even if it's a different <li> element. Neither is ideal. This is why a proper HTML parser usually does a better job at manipulating HTML.

But if the page is simple enough, a regex will work.

EDIT Corrected a gross error, thanks to @Nilpo.

edited Mar 29 '12 at 23:47

answered Mar 29 '12 at 19:32

kijin

8,702
2
26
32

1

That "convoluted sequence" **is** what makes it non-greedy. Please don't offer condescending answers if you don't know what you are talking about. – Nilpo Mar 29 '12 at 19:43
@Nilpo Thanks for pointing out my lack of knowledge there, but what makes you think my comment was condescending? What about your own comment? – kijin Mar 29 '12 at 23:04
Sometimes it's hard to tell someone's tone in writing. I apologize if I misread you. – Nilpo Mar 30 '12 at 02:58

score 2 · Answer 3 · answered Mar 29 '12 at 19:33

2

you don't need to use this hack approach, read the faq

"How can I edit / remove the Powered by MediaWiki image in the footer?"

answered Mar 29 '12 at 19:33

not at all correct, i am trying to do somthing entirely different, the media wiki example was just that - an example – Nick Mar 29 '12 at 19:42
Hiding an element using CSS and removing it completely from the page output are two completely different beasts. – Nilpo Mar 29 '12 at 19:52

score 2 · Accepted Answer · answered Mar 29 '12 at 19:45

2

The .*? portion of your regex is being escaped. Therefore, it isn't matching anything. Try this.

$reg = preg_quote('<li id="footer-poweredbyico">') . '.*?' . preg_quote('</li>'); 

preg_replace($reg,"",$str);

answered Mar 29 '12 at 19:45

Nilpo

4,675
1
25
39

Regular Expression formatting help required

4 Answers4