0

I'm trying to use Regex to scrape contents between the anchors
"<h2>Highlights</h2>" & "</div><div class="FloatClear"></div><div id="SalesMarquee">" within the HTML segment below:

But when I tried this regex, it returns nothing...

<h2>Highlights<\/h2>\t?\n?\s?\S?(.*?)<\/div>

I think it may have something to do with the empty spaces within the HTML source...

Can any Regex gurus give me the magic expression for grabbing everything between any given HTML archors, like the ones mentioned above (that can also cope with any empty spaces within the HTML source)?

BTW I can't use any PHP code as the Regex is for a script I purchased (there is just a text box I have to enter the regex into)...

Many thanks

HTML segment:

<div id="Highlights">

      <h2>Highlights</h2>

      <ul>

<li>1234</li>

<li>abc def asdasd asdasd</li>

<li>asdasda as asdasdasdas </li>

<li>asdasd asdasdas asdsad asdasd asa</li>

</ul>


     </div>

     <div class="FloatClear"></div>

     <div id="SalesMarquee">

      <div id="SalesMarqueeTemplate" style="display: none;">
user587064
  • 37
  • 1
  • 5

4 Answers4

1

Use any HTML dom parser like SIMPLE HTML DOM PARSER

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 
Naveed
  • 41,517
  • 32
  • 98
  • 131
1

Agree with Naveed - here is a post that is similar - Robust and Mature HTML Parser for PHP

Community
  • 1
  • 1
Rob
  • 10,004
  • 5
  • 61
  • 91
0

The following pcre regex should work.

/<h2>.*<\/h2>(.*)<\/div>/is

The last two characters is i for ignore case and s for dot all mode. Dot all mode makes the dot match newlines as well.

Edit: You'll probably want this regex instead:

/<h2>Highlights<\/h2>(.*)<\/div>.*<div class="FloatClear">/is
hlindset
  • 440
  • 2
  • 7
  • Thanks hlindset, but it doesn't work...I tried it here: http://www.rubular.com/r/nWJQTgYLQ9 – user587064 Jan 24 '11 at 10:59
  • Rubular.com is for Ruby regexes, and there are some differences. For example you'd need to end it with /im instead of /is to get the dot to match newlines, like this: http://www.rubular.com/r/48jKU6y74T – hlindset Jan 24 '11 at 23:21
0

Try adding an 'm' modifier (for 'multiline' to the regexes provided by hlindset:

/<h2>Highlights<\/h2>(.*)<\/div>.*<div class="FloatClear">/ism

Here it is in action:

Documentation on all modifiers is available by googling "pcre pattern modifiers".

Steven
  • 920
  • 2
  • 9
  • 21