Regex problem - when scraping HTML segment

Question

I'm trying to use Regex to scrape contents between the anchors
"<h2>Highlights</h2>" & "</div><div class="FloatClear"></div><div id="SalesMarquee">" within the HTML segment below:

But when I tried this regex, it returns nothing...

<h2>Highlights<\/h2>\t?\n?\s?\S?(.*?)<\/div>

I think it may have something to do with the empty spaces within the HTML source...

Can any Regex gurus give me the magic expression for grabbing everything between any given HTML archors, like the ones mentioned above (that can also cope with any empty spaces within the HTML source)?

BTW I can't use any PHP code as the Regex is for a script I purchased (there is just a text box I have to enter the regex into)...

Many thanks

HTML segment:

<div id="Highlights">

      <h2>Highlights</h2>

      <ul>

<li>1234</li>

<li>abc def asdasd asdasd</li>

<li>asdasda as asdasdasdas </li>

<li>asdasd asdasdas asdsad asdasd asa</li>

</ul>


     </div>

     <div class="FloatClear"></div>

     <div id="SalesMarquee">

      <div id="SalesMarqueeTemplate" style="display: none;">

score 1 · Answer 1 · answered Jan 24 '11 at 06:11

1

Use any HTML dom parser like SIMPLE HTML DOM PARSER

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

answered Jan 24 '11 at 06:11

Naveed

41,517
32
98
131

score 1 · Answer 2 · edited May 23 '17 at 09:58

1

Agree with Naveed - here is a post that is similar - Robust and Mature HTML Parser for PHP

edited May 23 '17 at 09:58

Community

1
1

answered Jan 24 '11 at 06:17

Rob

10,004
5
61
91

hlindset · Answer 3 · 2011-01-24T06:37:12.940

0

The following pcre regex should work.

/<h2>.*<\/h2>(.*)<\/div>/is

The last two characters is i for ignore case and s for dot all mode. Dot all mode makes the dot match newlines as well.

Edit: You'll probably want this regex instead:

/<h2>Highlights<\/h2>(.*)<\/div>.*<div class="FloatClear">/is

edited Jan 24 '11 at 06:37

answered Jan 24 '11 at 06:30

hlindset

440
2
7

Thanks hlindset, but it doesn't work...I tried it here: http://www.rubular.com/r/nWJQTgYLQ9 – user587064 Jan 24 '11 at 10:59
Rubular.com is for Ruby regexes, and there are some differences. For example you'd need to end it with /im instead of /is to get the dot to match newlines, like this: http://www.rubular.com/r/48jKU6y74T – hlindset Jan 24 '11 at 23:21

score 0 · Answer 4 · answered Jan 24 '11 at 16:10

Try adding an 'm' modifier (for 'multiline' to the regexes provided by hlindset:

/<h2>Highlights<\/h2>(.*)<\/div>.*<div class="FloatClear">/ism

Here it is in action:

http://www.rubular.com/r/td1IUBvg26

Documentation on all modifiers is available by googling "pcre pattern modifiers".

Regex problem - when scraping HTML segment

4 Answers4