0

I can't seem to figure out the regular expression I need in order to parse the following.

<div id="MustBeInThisId">
   <div class="ValueFromThisClass">
      The Value I need
   </div>
</div>

As you can see I have a wrapping div with an id. That div contain multiple other divs but only one of those divs I need the value from.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Paramount
  • 1,109
  • 14
  • 27
  • 3
    [You shouldn't try to parse HTML with RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Lily Ballard Jul 24 '11 at 05:55
  • Oh, what should I use? I'm using PHP by the way. Edit: noticed the link... thanks Edit 2: That's nice and all but it doesn't stop me from needed the ability. – Paramount Jul 24 '11 at 05:57

4 Answers4

4

If you are trying to extract some data from an HTML document, you should not use regular expressions.

Instead, you should use a DOM Parser : those are made exactly for that.


In PHP, you would use the DOMDocument class, and its DOMDocument::loadHTML() method, to load the HTML content.


Then, you can work with methods such as :

You can even work with DOMXpath to execute XPath queries on your HTML content -- which will allow you to search for pretty much anything in it.


In your case, I suppose that something like this should do the trick.

First, get your HTML content into a string (or use DOMDocument::loadHTMLFile()) :

$html = <<<HTML
<p>hello</p>
<div>
    <div id="MustBeInThisId">
    <div class="ValueFromThisClass">
        The Value I need
    </div>
    </div>
<div>
HTML;

Then, load it to a DOMDocument instance :

$dom = new DOMDocument();
$dom->loadHTML($html);

Instanciate a DOMXPath object, and use it to query your DOM object :
My XPath expression might be a bit more complex than necessary... I'm not really good with those...

$xpath = new DOMXPath($dom);
$items = $xpath->query('//div[@id="MustBeInThisId"]/div[@class="ValueFromThisClass"]');

And, finally, work with the results of that query :

if ($items->length > 0) {
    var_dump( trim( $items->item(0)->nodeValue ) );
}

And here is your result :

string 'The Value I need' (length=16)
Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • Great solution but unfortunate the HTML I'm loading isn't completely valid so `loadHTML` is throwing errors. Edit: I take that back, they are warnings not errors, so I can suppress them. – Paramount Jul 24 '11 at 06:31
  • You should be able to make those disappear, using the **`@` operator** *(it's generally not a good idea using it, though)*, see http://www.php.net/manual/en/language.operators.errorcontrol.php ;; or with the **`libxml` functions**, see http://www.php.net/manual/en/function.libxml-use-internal-errors.php for example. – Pascal MARTIN Jul 24 '11 at 06:33
1

Use something like simplehtmldom - it will make your life much, much easier.

$html = str_get_html($source_code);
$tag = $html->find("#MustBeInThisId .ValueFromThisClass", 0);
$the_value_i_need = $tag->innertext;
Amber
  • 507,862
  • 82
  • 626
  • 550
  • That was what I was using until I came to the situation that I need to pass cookies and a user agent. – Paramount Jul 24 '11 at 06:07
  • You were possibly using `file_get_html()` - notice that the code example I put there is using `str_get_html()`. You can pass it any string, from whatever you use to acquire the source code - for instance, you could use the cURL bindings to pass cookies/UA, and then take the resulting source code and pass it to `str_get_html()`. – Amber Jul 24 '11 at 06:10
0

Regex can't parse HTML since HTML isn't a regular language. You should use DOMDocument.

Then you get nice functions like getElementById :)

Paul
  • 139,544
  • 27
  • 275
  • 264
0

Or try a javascript library like JQuery. I think it's the easiest way to do vhat you want.

Plop
  • 13
  • 1