How to retrieve an html element opening tag from an html string?

Question

I'm trying to retrieve <h1> opening tags from a html string. I want to include everything from <h1 to >.

Right now this is how I'm trying to do it, however it seems it's causing problems with encoding as when I print the resulting $html utf-8 characters show incorrectly:

        $dom = new DOMDocument();
        $dom->loadHTML($html);

        //Evaluate Anchor tag in HTML
        $xpath = new DOMXPath($dom);


        $elements = $xpath->evaluate("/html/body//h1");

        for ($i = 0; $i < $elements->length; $i++) {
            print_r($elements->item($i));
        }

        // save html
        $html=$dom->saveHTML();

How can I make sure it includes everything up to the > closure?

Don't use regex for this. http://blog.codinghorror.com/parsing-html-the-cthulhu-way/ — Quentin, Sep 18 '14 at 13:47
hi Quentin, as you recommended I used DOM. I'll upload the implementation in a moment. — lisovaccaro, Sep 18 '14 at 14:16
I just edited the question to show what I'm doing, however I'm having a problem with encoding, I'm not sure how the evaluate method works so that may be causing the problem. — lisovaccaro, Sep 18 '14 at 14:18
@DJDavid98 It won't work well with things like `
`. See what happens if you use regex to parse HTML: http://stackoverflow.com/a/1732454/1529630 — Oriol, Sep 18 '14 at 14:20
@Oriol Any `<` and `>` that's not part of the HTML markup should be escaped as `<` and `>`. If someone uses problematic code like that along with regex, that's their fault. — SeinopSys, Sep 18 '14 at 14:28
@DJDavid98 From HTML5 spec, [attribute values](http://www.w3.org/TR/html5/syntax.html#syntax-attribute-value) only have the restriction restriction that the text cannot contain an ambiguous ampersand. Additionally, single-quoted attribute values can't contain `'`, and double-quoted attribute values can't contain `"`. Only [unquoted attribute values](http://www.w3.org/TR/html5/syntax.html#unquoted) have further restrictions, such as they can't contain `>`. — Oriol, Sep 18 '14 at 14:51
@lisovaccaro first question was for a regex, IMHO if we change questions drastically everybody will get confused, this post will be available not only for us, but also devs who are looking for similar solutions. If you changed your mind an now you want to use a parser do not edit the question with new doubts, you can start a new thread for that. — Marco Medrano, Sep 18 '14 at 16:32

Marco Medrano · Answer 1 · 2014-09-18T16:18:41.473

1

Not an expert, but I made this:

<h1( [^>]*(["'].*["'])\1*)?>

Here my tests: enter image description here

Update 1:

<h1\s*>|(.*=['"]*[^'"]*['"]*)>

Update 2:

<h1(.+=((['"]+[^'"]*['"]+)|[0-9]+))*\s*>

I modeled what should support an h1 tag.

edited Sep 18 '14 at 16:18

answered Sep 18 '14 at 14:26

Marco Medrano

2,530
1
21
35

It doesn't match `
`, `
` nor `
`.
– Oriol Sep 18 '14 at 14:34
@Oriol That's now been partially corrected, but still does not match `
` or the like. – SeinopSys Sep 18 '14 at 14:40
Update 1 matches `
`, but not until the end.
– Oriol Sep 18 '14 at 15:40
@Oriol you have a battery of tests right? ;) please check Update 2. – Marco Medrano Sep 18 '14 at 16:19

How to retrieve an html element opening tag from an html string?

`. See what happens if you use regex to parse HTML: http://stackoverflow.com/a/1732454/1529630

1 Answers1

`, `

` nor `

`.

`, but not until the end.