0

I'm trying to retrieve <h1> opening tags from a html string. I want to include everything from <h1 to >.

Right now this is how I'm trying to do it, however it seems it's causing problems with encoding as when I print the resulting $html utf-8 characters show incorrectly:

        $dom = new DOMDocument();
        $dom->loadHTML($html);

        //Evaluate Anchor tag in HTML
        $xpath = new DOMXPath($dom);


        $elements = $xpath->evaluate("/html/body//h1");

        for ($i = 0; $i < $elements->length; $i++) {
            print_r($elements->item($i));
        }

        // save html
        $html=$dom->saveHTML();

How can I make sure it includes everything up to the > closure?

lisovaccaro
  • 32,502
  • 98
  • 258
  • 410
  • 3
    Don't use regex for this. http://blog.codinghorror.com/parsing-html-the-cthulhu-way/ – Quentin Sep 18 '14 at 13:47
  • hi Quentin, as you recommended I used DOM. I'll upload the implementation in a moment. – lisovaccaro Sep 18 '14 at 14:16
  • I just edited the question to show what I'm doing, however I'm having a problem with encoding, I'm not sure how the evaluate method works so that may be causing the problem. – lisovaccaro Sep 18 '14 at 14:18
  • @DJDavid98 It won't work well with things like `

    `. See what happens if you use regex to parse HTML: http://stackoverflow.com/a/1732454/1529630

    – Oriol Sep 18 '14 at 14:20
  • @Oriol Any `<` and `>` that's not part of the HTML markup should be escaped as `<` and `>`. If someone uses problematic code like that along with regex, that's their fault. – SeinopSys Sep 18 '14 at 14:28
  • @DJDavid98 From HTML5 spec, [attribute values](http://www.w3.org/TR/html5/syntax.html#syntax-attribute-value) only have the restriction restriction that the text cannot contain an ambiguous ampersand. Additionally, single-quoted attribute values can't contain `'`, and double-quoted attribute values can't contain `"`. Only [unquoted attribute values](http://www.w3.org/TR/html5/syntax.html#unquoted) have further restrictions, such as they can't contain `>`. – Oriol Sep 18 '14 at 14:51
  • @lisovaccaro first question was for a regex, IMHO if we change questions drastically everybody will get confused, this post will be available not only for us, but also devs who are looking for similar solutions. If you changed your mind an now you want to use a parser do not edit the question with new doubts, you can start a new thread for that. – Marco Medrano Sep 18 '14 at 16:32

1 Answers1

1

Not an expert, but I made this:

<h1( [^>]*(["'].*["'])\1*)?>

Here my tests: enter image description here

Update 1:

<h1\s*>|(.*=['"]*[^'"]*['"]*)>

Update 2:

<h1(.+=((['"]+[^'"]*['"]+)|[0-9]+))*\s*>

I modeled what should support an h1 tag.

Marco Medrano
  • 2,530
  • 1
  • 21
  • 35