5

I have an html (sample.html) like this:

<html>
<head>
</head>
<body>
<div id="content">
<!--content-->

<p>some content</p>

<!--content-->
</div>
</body>
</html>

How do i get the content part that is between the 2 html comment '<!--content-->' using php? I want to get that, do some processing and place it back, so i have to get and put! Is it possible?

Gordon
  • 312,688
  • 75
  • 539
  • 559
esafwan
  • 17,311
  • 33
  • 107
  • 166
  • by "content" you mean `some content` or `

    some content

    ` and will the comment nodes always be written ``?
    – Gordon Aug 04 '10 at 10:05

5 Answers5

16

esafwan - you could use a regex expression to extract the content between the div (of a certain id).

I've done this for image tags before, so the same rules apply. i'll look out the code and update the message in a bit.

[update] try this:

<?php
    function get_tag( $attr, $value, $xml ) {

        $attr = preg_quote($attr);
        $value = preg_quote($value);

        $tag_regex = '/<div[^>]*'.$attr.'="'.$value.'">(.*?)<\\/div>/si';

        preg_match($tag_regex,
        $xml,
        $matches);
        return $matches[1];
    }

    $yourentirehtml = file_get_contents("test.html");
    $extract = get_tag('id', 'content', $yourentirehtml);
    echo $extract;
?>

or more simply:

preg_match("/<div[^>]*id=\"content\">(.*?)<\\/div>/si", $text, $match);
$content = $match[1]; 

jim

jim tollan
  • 22,305
  • 4
  • 49
  • 63
  • 1
    so where is the `id` attribute in ``? – Gordon Aug 04 '10 at 10:24
  • gordon - the section that's pulled out is the content that is contained within the content (id) div. along the same lines really as the jQuery $('#content').html() function – jim tollan Aug 04 '10 at 10:32
  • but how do i actually load the html to $yourentirehtml? – esafwan Aug 04 '10 at 10:36
  • esafwan - if inside your php program and on the same domain, file_get_contents http://www.google.co.uk/url?sa=t&source=web&cd=2&ved=0CCMQFjAB&url=http%3A%2F%2Fwww.w3schools.com%2FPHP%2Ffunc_filesystem_file_get_contents.asp&ei=TkNZTIPROqP20wTWhJHjCA&usg=AFQjCNFFmdhBH3AyMaQOkqXhAOHIGdauYQ&sig2=394azKvlwzBJCX6gXO2RPA if using javascript (or jquery) then the whole ball game changes. but i'm assuming you aren't as you haven't tagged the question as such. – jim tollan Aug 04 '10 at 10:39
  • 1
    @jim The OP asked for the content between the comment nodes. While pragmatically your solution is feasible, it it is technically not what was asked. – Gordon Aug 04 '10 at 10:56
  • yes, sometimes the impedence between the question and the answer CAN balance out - :-) thanks... – jim tollan Aug 04 '10 at 11:35
9

If this is a simple replacement that does not involve parsing of the actual HTML document, you may use a Regular Expression or even just str_replace for this. But generally, it is not a advisable to use Regex for HTML because HTML is not regular and coming up with reliable patterns can quickly become a nightmare.

The right way to parse HTML in PHP is to use a parsing library that actually knows how to make sense of HTML documents. Your best native bet would be DOM but PHP has a number of other native XML extensions you can use and there is also a number of third party libraries like phpQuery, Zend_Dom, QueryPath and FluentDom.

If you use the search function, you will see that this topic has been covered extensively and you should have no problems finding examples that show how to solve your question.

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • 1
    +1, and if you're looking for a proper XPath to match the nodes, it's `(//*|//text())[preceding-sibling::comment()='content' and following-sibling::comment()='content']` – Wrikken Aug 04 '10 at 10:50
  • Thanx... All the links helped me a lot, though it didn't answer me straight.. Links are worth reading and helped me gain more depth in php! – esafwan Aug 04 '10 at 11:32
3
<?php

    $content=file_get_contents("sample.html");
    $comment=explode("<!--content-->",$content);
    $comment=explode("<!--content-->",$comment[1]);
    var_dump(strip_tags($comment[0]));
?>

check this ,it will work for you

Ankur Mukherjee
  • 3,777
  • 5
  • 32
  • 39
2

Problem is with nested divs I found solution here

<?php // File: MatchAllDivMain.php
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
// Commented regex to extract contents from <div class="main">contents</div>
//  where "contents" may contain nested <div>s.
//  Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{           # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*>              # match the "main" class DIV opening tag
  (                                   # capture "main" DIV contents into $1
    (?:                               # non-cap group for nesting * quantifier
      (?: (?!<div[^>]*>|</div>). )++  # possessively match all non-DIV tag chars
    |                                 # or 
      <div[^>]*>(?1)</div>            # recursively match nested <div>xyz</div>
    )*                                # loop however deep as necessary
  )                                   # end group 1 capture
</div>                               # match the "main" class DIV closing tag
}six';  // single-line (dot matches all), ignore case and free spacing modes ON

// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?    1)</div>)*)</div>}si';

$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
    echo("$matchcount matches found.\n");
//  print_r($matches);
    for($i = 0; $i < $matchcount; $i++) {
        echo("\nMatch #" . ($i + 1) . ":\n");
        echo($matches[1][$i]); // print 1st capture group for match number i
    }
} else {
    echo('No matches');
}
echo("\n</pre>");
?>
piernik
  • 3,507
  • 3
  • 42
  • 84
1

Have a look here for a code example that means you can load a HTML document into SimpleXML http://blog.charlvn.com/2009/03/html-in-php-simplexml.html

You can then treat it as a normal SimpleXML object.

EDIT: This will only work if you want the content in a tag (e.g. between <div> and </div>)

NullUserException
  • 83,810
  • 28
  • 209
  • 234
Jake
  • 79
  • 5