Regular Expressions - Where Angels Fear to Tread

Question

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.

What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.

My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.

For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?

The entire web page is contained within a single text string and the filtered result should also be a single string of text.

I'm not sure, but I think the code to do this could have a format similar to:

$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);

The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".

Regular expressions are obviously the work of Satan!

Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.

Thanks, Jay

how can someone suggest a reg exp unless you tell what this "certain content" looks like or what tags is it between??what are these ... in you question? — ayush, Jan 29 '11 at 06:16
Please, for the love of God, do not use regex to parse HTML. It just does not ever work. Instead, use an XML parser. — Rafe Kettler, Jan 29 '11 at 06:21
I don't think you need to know in advance what is between the tags, only how to target the block you wish to replace, such as the 5th occurrence of a given tag. Then whatever text you wish is inserted in place of whatever was there before. — Jay, Jan 29 '11 at 06:27
Related questionL: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Andrew Grimm, Jan 29 '11 at 06:28
Regular Expressions work excellent for a variety of tasks; however, parsing (X)HTML is certainly not one of them. — Salman A, Jan 29 '11 at 06:36
Regexes are workable for this task, but only if you are much more proficient with them. Please read http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php and consider the alternatives (phpQuery or QueryPath are often the easiest approach). — mario, Jan 29 '11 at 06:48
@mario, he says he wants to match _any valid pair of tags_ this is categorically impossible to do with just regexes. — tobyodavies, Jan 29 '11 at 07:57
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Jan 29 '11 at 12:05
@tobyodavies: Be more careful with your “categorical impossibilities”: you are mistaken. It is perfectly reasonable to use regexes in a lexer. Parsing is more difficulty, but [far from impossible](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326). — tchrist, Jan 30 '11 at 03:44
@tchrist, yes, as i have learnt from reading your answers... hence the current question i have open, so i can learn what is and is not categorically impossible http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes — tobyodavies, Jan 30 '11 at 03:46

score 3 · Answer 1 · edited May 23 '17 at 10:33

3

This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser

On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath

The reason god kills kittens when you parse HTML with a regex:

Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.

Edit: thank you @Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 10:33

Community

1
1

answered Jan 29 '11 at 06:23

tobyodavies

27,347
5
42
57

ZOMG! CFG knowledge from that CS course is back to haunt me again. – Olhovsky Jan 29 '11 at 06:41
1

You might want to read the linked topic. There are much more relevant answers beyond the fun rant if you read it once. It also seems not very relevant since OP did inquire neither about **parsing** nor nested xhtml. Matching and extraction might be entirely sufficient for his use case. – mario Jan 29 '11 at 06:43
Since I've never touched regular expressions before, I needed some guidance on how certain tags pairs could be located and the content between them replace with something else. From all the stuff I see on the net and the PHP functions manual, they make it sound like regular expressions are the ideal way to target and alter content within a web page. I'm using PHP and have no idea how to use XML yet for such a purpose or even if I can use it on my site. The prevalence of regular expressions in the programs I examined, seems to suggest its widely used, but it looks like Martian to me. – Jay Jan 29 '11 at 06:45
@Jay, that is because most php programmers are not very good... Just because most of the lemmings are jumping off a cliff does not make it a good idea. I am ashamed to admit i get payed to write php... – tobyodavies Jan 29 '11 at 07:54
@mario, i have read most of the answers to that question before, as a researcher, who has done quite a bit of work on parsing, a little bit of me dies every time i see this kind of question - trying to do something with regexes that is easier and more efficient to do with a more powerful parser... – tobyodavies Jan 29 '11 at 07:56
@mario, while the OP didn't use the _word_ parsing it was clearly stated that he wanted to match any arbitrary _matched pair_ of tags. This simply cannot be achieved with just a regex, period. It is as impossible as building the halting machine... – tobyodavies Jan 29 '11 at 08:00
1

I hope you realize the irony of linking to simplehtmldom then. While it also uses a character state machine, it's largely based on regular expressions. Another good reason to avoid mentioning it is the clumsy API. Both phpQuery and QueryPath are easier and an appropriate example make people more willing to use it than reciting the regex are evil meme. – mario Jan 29 '11 at 08:09
@mario - fair enough, i've not used simpledom much, because i know all the html documents I manipulate are xhtml, but I didn't want to assume this for the OP, will replace with your suggestions. – tobyodavies Jan 29 '11 at 08:15
Modern Regex are not regular and they can parse almost everything, including HTML. But the effort to write a general full blown HTML parser is simply not worth it when there is DOM parsers available. I had an interesting discussion about this on SO, which I have copied to my blog at http://gooh.posterous.com/regular-expressions-are-not-regular – Gordon Jan 29 '11 at 12:05
@Gordon, It is worrying that TC thinks `(.)\1` is not regular - it clearly is for any finite alphabet - it is equivalent to `aa|bb|cc|dd...` back-references do not make regexes any more powerful, just much much shorter. No modern regex feature I have seen makes parsing/recognising CFLs possible – tobyodavies Jan 29 '11 at 12:40
@Gordon, having looked at some indirectly linked stuff, recursive regexes appear, on the surface, powerful enough to parse this, they would be a nightmare to maintain compared to using a real parser... – tobyodavies Jan 29 '11 at 13:02

score 3 · Answer 2 · answered Jan 29 '11 at 06:33

Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.

Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:

Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".

When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?

Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)

By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.

Thanx all: All of the suggestions are useful (except merely giving up and not using it at all). I do advanced scientific and math programming and I do NOT give up easily. LOL This is just a new aspect I never touched before, so I'm out of my element here. I do know how to replace any substring within a web page. The problem is locating the nth occurrence of a particular starting tag and then replacing all code after it up to the following closing tag. I was looking for a general formula. — Jay, Jan 29 '11 at 07:05

score 0 · Answer 3 · answered Jan 29 '11 at 06:26

ok, few ground rules.

~~Dont post a question like that, pre-ing all the question, will only keep people away~~
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
@tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways

Now, to your problem. With this one:

$regex = "#<div>(.+?)</div>#si";

You should be ok using that expression and counting the occurences, much like this:

preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );

Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match

if (count($matches) > 5 )
{
   $myMatch = $matches[5][0];
   $matchedText = $matches[5][1];
}

Good luck in your efforts...

`
foo
bar
` Oh noes! dead kittens! - regexes _cannot parse html_ give up — tobyodavies, Jan 29 '11 at 06:28
You got me wrong, I said that parsing html with regex was a bad idea, but I did give him a jump start, because he said that he was **studying regular expressions in PHP** — David Conde, Jan 29 '11 at 06:36

Regular Expressions - Where Angels Fear to Tread

3 Answers3

Linked