1

Here's my problem. I am scraping a website for data, and would like to use regex to get the contents of three similar divs. They are structured like this:

    <div id="cphMain_pnlBreakfastItems" class="bp2-wdn-col-one-third">
        <h4>blah blah</h4>
        <span>content</span>
        <span>other content</span>
    </div>
    <div id="cphMain_pnlLunchItems" class="bp2-wdn-col-one-third">
        <h4>blah blah</h4>
        <span>content</span>
        <span>other content</span>
    </div>
    <div id="cphMain_pnlDinnerItems" class="bp2-wdn-col-one-third">
        <h4>blah blah</h4>
        <span>content</span>
        <span>other content</span>
    </div>

There are 3 separate divs: Breakfast, Lunch, and Dinner Items. I am trying to use preg_match to get them all as matches like this.

    preg_match('/<div id="cphMain_pnl.*Items"[\s\S]*\/div>/s', $page, $match);

However, after running this, I get all three divs as one match instead of three separate matches. How can I get them as three separate matches?

I tried using DOM to do this, but when I got the contents of the divs, it had stripped the tags so I didn't know what content is what.

Brobin
  • 3,241
  • 2
  • 19
  • 35
  • 1
    @Dai: No, please don't close as a duplicate of that question. It is not helpful to anyone who is looking for an answer to this question. It is just a rant, and not really suitable as a duplicate target. See [this post](http://meta.stackoverflow.com/a/252393/1438393) for reasons why.) – Amal Murali Jul 01 '14 at 21:18
  • @Dai There aren't really any good answers there. I just need a quick solution for my case, not a circlejerk about how bad regex is. I stated my problem as getting one result instead of 3, not even the same question as the one you linked. – Brobin Jul 01 '14 at 21:25

1 Answers1

1

You have been using greedy matching while in such cases, you're better off with lazy matching. If you have 3 divs one after the other, the Items that you are matching belongs to DinnerItems instead of BreakfastItems (the . matches as many characters until the last Items).

To turn the greedy match into lazy, add a ? after the quantifier. Also, if you are using the s flag, you might as well use . instead of [\s\S]:

preg_match_all('~<div id="cphMain_pnl.*?Items".*?</div>~s', $page, $match);

regex101 demo

Also, you need to use preg_match_all to get all matches. preg_match fetches the first match only.

I used different delimiters as well to avoid having to escape slashes when it can be avoided.

Jerry
  • 70,495
  • 13
  • 100
  • 144