0

On my code I have the follwoing regexp:

 preg_match_all('/<title>([^>]*)<\/title>/si', $contents, $match );

That retrieves the <h>..</h> tags from a webpage. But sometimes it may have html tags such as <strong>,<b> etc etc therefore It needs some modification therefore I tried this one

preg_match_all('/<h[1-6]>(.*)<\/h[1-6]>/si', $contents, $match );

But something wrong and does not retrieve the content that is in html <h> tags.

Can you help me to modify correctly the regexp?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Dimitrios Desyllas
  • 9,082
  • 15
  • 74
  • 164
  • 7
    [Have your tried using a DOM parser?](http://stackoverflow.com/a/1732454/511529) – GolezTrol May 13 '16 at 21:32
  • 4
    If the `h`s have any attributes this will fail. `.*` is also greedy if you have more than one on the page it will eat everything. Parser is your best approach. Take a look at http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php – chris85 May 13 '16 at 21:34
  • 1
    As it says in that other post, don't use regex to parse HTML unless your html is dead simple and you don't need to search for nested tags. Even then, bad idea. There are DOM parsers ([DOMDocument](https://php.net/domdocument)) that are made for parsing HTML and are quite easy to work with. They have several of the same methods available to JS like `getElementsByTagName` which could be used to find each `` tag. – Jonathan Kuhn May 13 '16 at 21:37

3 Answers3

1
preg_match_all('<h\d>', $contents, $matches);

foreach($matches as $match){
$num[] = substr ( $match  , 1 , 1 );
}
xpeiro
  • 733
  • 5
  • 21
0

When use (.*) you take everything, for just words, digits and space, maybe you can use a range with them and take one or more:

preg_match_all('/<h[1-6]>([\w\d\s]+)<\/h[1-6]>/si', $contents, $match);
El_Happy
  • 56
  • 4
0

Now, here's no Regex expert but should he be in your shoes; He'd do it like so:

    <?php

        // SIMULATED SAMPLE HTML CONENT - WITH ATTRIBUTES:
        $contents = '<section id="id-1">And even when darkness covers your path and no one is there to lend a hand;
            <h3 class="class-1">Always remember that <em>There is always light at the end of the Tunnel <span class="class-2">if you can but hang on to your Faith!</span></em></h3>
            <div>Now; let no one deceive you: <h2 class="class-2">You will be tried in ever ways - sometimes beyond your limits...</h2></div>
            <article>But hang on because You are the Voice... You are the Light and you shall rule your Destiny because it is all about<h6 class="class4">YOU - THE REAL YOU!!!</h6></article>
            </section>';

        // SPLIT THE CONTENT AT THE END OF EACH <h[1-6]> TAGS   
        $parts      = preg_split("%<\/h[1-6]>%si", $contents);
        $matches    = array();

        // LOOP THROUGH $parts AND BUNDLE APPROPRIATE ELEMENTS TO THE $matches ARRAY.       
        foreach($parts as $part){
            if(preg_match("%(.*|.?)(<h)([1-6])%si", $part)){
                $matches[] = preg_replace("%(.*|.?)(<)(h[1-6])(.*)%si", "$2$3$4$2/$3>", $part);
            }
        }
        var_dump($matches);


        //DUMPS::::
        array (size=3)
          0 => string '<h3 class="class-1">Always remember that <em>There is always light at the end of the Tunnel <span class="class-2">if you can but hang on to your Faith!</span></em></h3>' (length=168)
          1 => string '<h2 class="class-2">You will be tried in ever ways - sometimes beyond your limits...</h2>' (length=89)
          2 => string '<h6 class="class4">YOU - THE REAL YOU!!!</h6>' (length=45)

As a Function, this is what it boils down to:

 <?php

        function pseudoMatchHTags($htmlContentWithHTags){
            $parts      = preg_split("%<\/h[1-6]>%si", $htmlContentWithHTags);
            $matches    = array();
            foreach($parts as $part){
                if(preg_match("%(.*|.?)(<h)([1-6])%si", $part)){
                    $matches[] = preg_replace("%(.*|.?)(<)(h[1-6])(.*)%si", "$2$3$4$2/$3>", $part);
                }
            }
            return $matches;
        }

        var_dump(pseudoMatchHTags($contents));

You can test it here: https://eval.in/571312 ... perhaps it helps a bit... i hope... ;-)

Poiz
  • 7,611
  • 2
  • 15
  • 17