0

a friend of mine is having a problem with regular expressions. He basically has this HTML code:

<a>I don't want this</a>
startString
test1
<a>I want this1</a>
test2
<a>I want this2</a>
endString
gibberish
<a>I don't want this</a>
startString
test1
<a>I want this3</a>
test2
<a>I want this4</a>
endString
gibberish
<a>I don't want this</a>

Like I wrote in the headline, he currently uses 2 regexes to get the "I want this" strings in the code above:

(?<=startString).+?(?=endString)
<a>(.+?)</a>

He now wants to combine these 2 into one regex that does the same. Could anybody explain if this is possible and if it is, how to do it?

Thank you!

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
brdigi
  • 135
  • 1
  • 6
  • 2
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Sep 07 '13 at 17:27
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - read the top answer – d'alar'cop Sep 07 '13 at 17:30
  • If you are using Ruby then Nokogiri is enough for this purpose. – Arup Rakshit Sep 07 '13 at 17:30
  • Thanks for the input guys. He'll take a look at it. :) – brdigi Sep 07 '13 at 17:33

2 Answers2

0

A pattern like this would work (in single-line mode):

(?<=startString.*)<a>(.+?)</a>(?=.*endString)
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • This is almost correct. See: https://dl.dropboxusercontent.com/u/50869443/temp/fastrichtig.PNG It does match one "I don't want this", but matches all the other correct ones. Still, thanks, getting closer! However, what exactly do you mean by "single-line" mode? – brdigi Sep 07 '13 at 17:38
0

The short answer is that only for engines that have group collections can the two regexes from your friend be combined into a single regex. I can think of Dot-Net.

Examining your friends expressions:

 (?<=startString).+?(?=endString)

This gets the first pair and everything inbetween, including unbalanced starts. It should have been 'startString(.+?)endString', but still, the same result. If he wanted mutual exclusion pairs, it would have been 'startString( (?:(?!startString).)+? )endString'. So you can see he relaxed the expression to allow multiple starts with the first single end.

That alone precluds @Jerry's approach from working.

 <a>(.+?)</a>

This next expression as a stand alone will return 1 match. It can't be use for instance like this '(?:(.+?))+' and be expected to accumulate an array of capture buffer 1's It returns 1 match with capture buffer 1 containing the last match. That is unless the language supports collections (ie: Dot-Net).

In the case of collections, these two are easily combined into a single expression.

In summary, being gone for a while and now back, it still suprises me the level of uninformed acceptance of answers around here.