2

Using regex and PHP I am trying to get the content of the title attribute as below.

preg_match('/<abbr class="dtstart" title="([^"]*)"/i', $file_string, $starts);
$starts_out = $starts[1];

preg_match('/<abbr class="dtend" title="([^"]*)"/i', $file_string, $ends);
$ends_out = $ends[1];

Here is the exact part of the code that I want to get, and I get the data correctly.

<div id="eventDetailInfo">
    <h2>When</h2>
    <div class="p">
        <div>From:
            <abbr class="dtstart" title="2012-08-24T17:00:00">Friday, August 24th, 2012</abbr></div>
        <div>Until:
            <abbr class="dtend" title="2012-08-26">Saturday, August 25th, 2012</abbr></div>
    </div>
</div>

However, because sometimes there is no Until in some articles, the regex matches the first of the remaining code ( this is related articles).

My question is how do I restrict the regex to match only the above, and if no

<div>Until:
                <abbr class="dtend" title="2012-08-26">Saturday, August 25th, 2012</abbr></div>

is found, to remain blank?

This is the rest code of the page, unfortunately the regex matches it.

<div class="evdate">
    <em>When:</em>
    <abbr class="dtstart" title="2012-07-03T21:00:00">July 3rd</abbr>
    to
    <abbr class="dtend" title="2012-07-13">July 12th</abbr>*
</div>
<div class="evtime"><em>Time:
    </em>
    21:00
</div>
</div>
EnexoOnoma
  • 8,454
  • 18
  • 94
  • 179
  • 4
    Using [regex to parse HTML is a bad idea](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) - you should try using an XML parser or something similar – ernie Jul 31 '12 at 17:01

2 Answers2

1

Whilst I agree with the others about not using regex to match HTML - personally I find regex to be extremely helpful if you need to know exactly what you can get. Unless your scraping loads of different sources you don't often need the consistency a DOM framework would give you.

Anywho, given your question I don't think DOM will necessarily help you, you'll still need to design it to only pick up from within certain classes/patterns. The way to do this is to expand your regex to match more than just what you want out but also the containing content, so you need to include something unique within the pattern so it won't match the related article. (the same as you would need to do with the DOM, albeit a little easier!)

williamvicary
  • 805
  • 5
  • 20
0

While I've shown you how to do this with a quick regex, I clearly advised you against using a regex for this sort of thing. As you can see for yourself, it can get out of hand rather quickly.

As pointed out by others (here and there), you should be using an HTML parser for this.


I'd advise you to use Simple HTML DOM, since it's very easy to work with, and their documentation is pretty good too.

Community
  • 1
  • 1
Joseph Silber
  • 214,931
  • 59
  • 362
  • 292