16

I want to receive an array that contains all the h1 tag values from a text

Example, if this where the given input string:

<h1>hello</h1>
<p>random text</p>
<h1>title number two!</h1>

I need to receive an array containing this:

titles[0] = 'hello',
titles[1] = 'title number two!'

I already figured out how to get the first h1 value of the string but I need all the values of all the h1 tags in the given string.

I'm currently using this to receive the first tag:

function getTextBetweenTags($string, $tagname) 
 {
  $pattern = "/<$tagname ?.*>(.*)<\/$tagname>/";
  preg_match($pattern, $string, $matches);
  return $matches[1];
 }

I pass it the string I want to be parsed and as $tagname I put in "h1". I didn't write it myself though, I've been trying to edit the code to do what I want it to but nothing really works.

I was hoping someone could help me out.

Thanks in advance.

Pieter888
  • 4,882
  • 13
  • 53
  • 74
  • 1
    At minimum you should use preg_match_all. Take a look at http://simplehtmldom.sourceforge.net/ – fabrik Jul 21 '10 at 12:15

4 Answers4

36

you could use simplehtmldom:

function getTextBetweenTags($string, $tagname) {
    // Create DOM from string
    $html = str_get_html($string);

    $titles = array();
    // Find all tags 
    foreach($html->find($tagname) as $element) {
        $titles[] = $element->plaintext;
    }
}
Sergey Eremin
  • 10,994
  • 2
  • 38
  • 44
  • Oooh I didn't know you could do that! – Rimian Jul 21 '10 at 12:26
  • 2
    Is simplehtmldom any faster then DOMDocument or just for those occasions where DOMDocument doesn't exist (although it's enabled by default)? – Wrikken Jul 21 '10 at 12:39
  • 1
    @Wrikken it is userland code, so it doubt it is faster. Dunno why people are so fascinated with it (must be the *simple* in the name), especially because there is also [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.query.html), [phpquery](http://code.google.com/p/phpquery/) or [FluentDom](http://nightly.fluentdom.org/documentation) for alternatives. – Gordon Jul 21 '10 at 12:49
  • @Wrikken it isn't faster(almost the same), but handles invalid html better. also much less problems with non-utf encodings... – Sergey Eremin Jul 21 '10 at 12:49
  • @kgb you [claimed this before](http://stackoverflow.com/questions/3220076/php-domdocument-how-can-i-print-an-attribute-of-an-element/3220084#3220084) but I still refute it. DOM handles HTML fine. – Gordon Jul 21 '10 at 12:52
  • i've had a better example, but now i could find only this: http://stackoverflow.com/questions/1183482/cant-separate-cells-properly-with-simplehtmldom. the point is - simplehtmldom works if the html is not a valid tree.. and i don't like putting "@" to suppress warnings ;) – Sergey Eremin Jul 21 '10 at 13:01
  • 2
    @kgb DOM can load invalid HTML fine if you load it with loadHTML. The only thing not working then is getElementById and that is solely due to the fallback to the HTML4.0 DTD. You can still very much query nodes by ID via XPath then. Also, you do not have to suppress the errors with @ at all. You can use libxml_use_internal_errors and handle any errors by custom error handlers. SimpleHTMLDom isnt more suitable for HTML. It doesnt even use libxml but parses the HTML with String functions. – Gordon Jul 21 '10 at 13:22
  • I'd take 'can report errors but is configurable to keep quiet' above 'will not tell you when something is up' any time of the week :) – Wrikken Jul 21 '10 at 15:52
  • 2
    -1 for not using the built in c extension to do the exact same thing (Seriously, why do things in PHP if the exact same thing is built into the PHP core?)... Use `DomDocument` instead... – ircmaxell Jul 21 '10 at 15:56
  • > we use DOMDocument and modern php classes – jkoop Feb 03 '23 at 15:55
25
function getTextBetweenTags($string, $tagname){
    $d = new DOMDocument();
    $d->loadHTML($string);
    $return = array();
    foreach($d->getElementsByTagName($tagname) as $item){
        $return[] = $item->textContent;
    }
    return $return;
}
Wrikken
  • 69,272
  • 8
  • 97
  • 136
10

Alternative to DOM. Use when memory is an issue.

$html = <<< HTML
<html>
<h1>hello<span>world</span></h1>
<p>random text</p>
<h1>title number two!</h1>
</html>
HTML;

$reader = new XMLReader;
$reader->xml($html);
while($reader->read() !== FALSE) {
    if($reader->name === 'h1' && $reader->nodeType === XMLReader::ELEMENT) {
        echo $reader->readString();
    }
}
Gordon
  • 312,688
  • 75
  • 539
  • 559
6
 function getTextBetweenH1($string)
 {
    $pattern = "/<h1>(.*?)<\/h1>/";
    preg_match_all($pattern, $string, $matches);
    return ($matches[1]);
 }
Ahmed Aman
  • 2,373
  • 1
  • 19
  • 33
  • 12
    Using regex is quite fine here. He isn't parsing HTML. He is matching stuff between `

    ` and `

    `, which is inherently regular. Matching a regular language with regular expressions is quite fine. Drop the mindless "OMG regex cannot be used for anything if there is HTML involved" crap that everybody seems to be hyping. It's not like he is trying to match all of HTML, only a very small subset of the language which happens to be regular.
    – Daniel Egeberg Jul 21 '10 at 13:17
  • 2
    @Daniel what if there is attributes to the `

    `? What if the headings contain element children?

    – Gordon Jul 21 '10 at 13:28
  • 2
    @Gordon: The attribute problem can be solved using this regex: `#

    ])*>(.*?)

    #i` (which I believe still describes a regular language and thus can be represented using a finite state machine). The problem with child elements is non-existent because there cannot be an `

    ` within another `

    ` anyways. Edit: The regex is written for a single-quoted PHP string.

    – Daniel Egeberg Jul 21 '10 at 13:51
  • 2
    @Daniel you have to admit that this is completely unreadable :) Also, there can be inline elements in an h1. What about spans? strongs? ems? The h1 of this very page has a link inside. Regex has no concept of TextNodes. It just knows Strings. – Gordon Jul 21 '10 at 13:56
  • 2
    This regex will still work, even if there are inline elements inside the H1 element... IMHO, it doesn't matter if it is unreadable either, as it is very much a set and forget function. – evilunix Jan 06 '15 at 14:18
  • Ten years later and it's still completely readable, if you actually understand and know how to write regex. I like this solution and the regex is quite simple. Easy to do ('|") for double quote matches. – Mike Kormendy Sep 13 '20 at 16:30