PHP Getting text between HTML nodes

Question

Title says all. How can I get the text between HTML nodes using PHP? Any ideas? Below is my HTML structure.

<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <div id="outer">
        <div id="first">
            <p class="this">Hello</p>
            <p class="this">Community</p>
        </div>
        <div id="second">
            <p class="that">Stack</p>
            <p class="that">Overflow</p>
        </div>
    </div>
</body>

Expected output:

HelloStackOverflowCommunity

Use an HTML parser library, like `DOMDocument` or `PHP Simple HTML DOM`. — Barmar, Oct 23 '14 at 08:07
Is there anything you have tried? If so could you post the code. — Jan-Willem de Boer, Oct 23 '14 at 08:09
http://stackoverflow.com/questions/17054815/php-domdocument-read-element-inner-text and you might also want to look at strip_tags() — GordonM, Oct 23 '14 at 08:13
just play with this: http://php.net/manual/en/class.domdocument.php — , Oct 23 '14 at 08:33

score 1 · Answer 1 · answered Oct 23 '14 at 08:33

That's quite easy, get PHP Simple HTML DOM Parser here: http://sourceforge.net/projects/simplehtmldom/files/

Then use the following code:

/* include simpledom*/
include('simple_html_dom.php');

/* load html string */
$html_string = <<<HTML
<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <div id="outer">
        <div id="first">
            <p class="this">Hello</p>
            <p class="this">Community</p>
        </div>
        <div id="second">
            <p class="that">Stack</p>
            <p class="that">Overflow</p>
        </div>
    </div>
</body>
</html>
HTML;

/* create simple dom object from html */
$html = str_get_html($html_string);

/* find all paragraph elements */
$paragraph = $html->find('div[id=outer] div p');

/* loop through all elements and get inner text */
foreach($paragraph as $p){
    echo $p->innertext;
}

Cheers,

Roy

KIKO Software · Answer 2 · 2014-10-23T14:14:02.997

You could try:

$text = strip_tags($html);

http://www.php.net/manual/en/function.strip-tags.php

That will get you quite far. It leaves spaces and returns, but those are easy to remove.

$clean = str_replace(array(' ',"\n","\r"),'',$text);

http://www.php.net/manual/en/function.str-replace.php

Using it on your example gives:

TestPageHelloCommunityStackOverflow

If you want to leave some spaces intact you could instead try:

$clean = trim(implode('',explode("\n",$text)));

which results in:

Test Page Hello Community Stack Overflow

Many variations are possible.

score 0 · Answer 3 · answered Oct 23 '14 at 08:14

Regular expressions are strongly not recommended to parse HTML.
Use Simple HTML library: http://sourceforge.net/projects/simplehtmldom/files/simplehtmldom/
Include it: include 'simple_html_dom.php';
Get tags you need: $tags = $html->find('p');
Create array: $a = array(); foreach ($tags as $tag) $a[] = $tag->innertext;;
Create your string: $string = $a[0] . $a[2] . $a[3] . $a[1];

score 0 · Accepted Answer · answered Oct 23 '14 at 08:23

I would recommend you to use PHP inbuilt DOMDocument rather than a third party class like simplehtmldom.

On big HTML files they are really slow (I have worked with them).

<?php
$html ='
<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <div id="outer">
        <div id="first">
            <p class="this">Hello</p>
            <p class="this">Community</p>
        </div>
        <div id="second">
            <p class="that">Stack</p>
            <p class="that">Overflow</p>
        </div>
    </div>
</body>
';

// a new dom object
$dom = new domDocument; 
$dom->preserveWhiteSpace = false;

// load the html into the object
$dom->loadHTML($html); 
// get the body tag
$body = $dom->getElementsByTagName('body')->item(0);
 // loop through all tags
foreach($body->getElementsByTagName('*') as $element ){
    // print the textValue
    print $element->firstChild->textContent;
}

The output would be HelloCommunity StackOverflow

score -1 · Answer 5 · answered Oct 23 '14 at 08:08

-1

Try this one

function getTextBetweenTags($string, $tagname)
 {
    $pattern = "/<$tagname>(.*?)<\/$tagname>/";
    preg_match($pattern, $string, $matches);
    return $matches[1];
 }

You have to loop through the $matches array...

answered Oct 23 '14 at 08:08

envoy

143
1
2
9

@jurgemaister damn, beat me to it! – GordonM Oct 23 '14 at 08:14

PHP Getting text between HTML nodes

5 Answers5