3

I have opened an HTML file using

file_get_contents('http://www.example.com/file.html')

and want to parse the line including "ParseThis":

 <h1 class=\"header\">ParseThis<\/h1>

As you can see, it's within an h1 tag (the first h1 tag from the file). How can I get the text "ParseThis"?

Ryan Kohn
  • 13,079
  • 14
  • 56
  • 81
John Paneth
  • 367
  • 2
  • 6
  • 9

3 Answers3

5

You can use DOM for this.

// Load remote file, supress parse errors
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com/file.html');
libxml_clear_errors();

// use XPath to find all nodes with a class attribute of header
$xp = new DOMXpath($dom);
$nodes = $xp->query('//h1[@class="header"]');

// output first item's content
echo $nodes->item(0)->nodeValue;

Also see

Marking this CW because I have answered this before, but I am too lazy to find the duplicate

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
4

Use this function.

<?php
function get_string_between($string, $start, $end)
{
    $string = " ".$string;
    $ini = strpos($string,$start);
    if ($ini == 0)
        return "";
    $ini += strlen($start);
    $len = strpos($string,$end,$ini) - $ini;
    return substr($string,$ini,$len);
}

$data = file_get_contents('http://www.example.com/file.html');

echo get_string_between($data, '<h1 class=\"header\">', '<\/h1>');
shamittomar
  • 46,210
  • 12
  • 74
  • 78
  • It may work for this case, but you should be using DOM selectors or XML navigation. – Incognito Aug 28 '10 at 17:21
  • I prefer this because it work faster than DOM and when there are very simple requirements like this, I use my `get_string_between` :) – shamittomar Aug 28 '10 at 17:27
1

Since it is the first h1 tag, getting it should be fairly trivial:

$doc = new DOMDocument();
$doc->loadHTML($html);
$h1 = $doc->getElementsByTagName('h1');
echo $h1->item(0)->nodeValue;

http://php.net/manual/en/class.domdocument.php

karim79
  • 339,989
  • 67
  • 413
  • 406