how to create a pattern for preg_match_all

Question

I tried googling this but I couldnt find anything clear about it. first I was hoping someone could help me write a pattern to get the info between these tags :

<vboxview leftinset="10" rightinset="0" stretchiness="1">    // CONTENT INSIDE HERE </vboxview>

and second, could you also please explain the pattern in details for each section and what it does and how you specify to get a certain part of the code.

Use an XML parser. Regexex are not designed for parsing XML or HTML. — Cfreak, Nov 08 '11 at 19:21
ok, but then what are preg_match_all used for?? because on php.net they actually show examples where they parse html. — Ahoura Ghotbi, Nov 08 '11 at 19:23
@AhouraGhotbi - Yes it's a bad example and they should change it. Regular expressions are for parsing data that has patterns in it. XML and HTML by definition are unstructured. You can use a regex to parse them but it's not a good idea because there's no requirement that the file be structured a certain way. In other words you run a high risk that your program will break even if someone gave you an XML file that matches your spec. — Cfreak, Nov 08 '11 at 19:26
@Cfreak thanks for the useful info :) ... well I didnt know that... — Ahoura Ghotbi, Nov 08 '11 at 19:32
Well, html does have patterns in it. It's just that they're not RELIABLE patterns. — Marc B, Nov 08 '11 at 20:05
You might want to have a look into PHPs XSLT functions, these are specifically made for modifying XML documents. http://www.php.net/manual/en/book.xslt.php — Jan Thomä, Nov 08 '11 at 20:05
"Regular Expressions' are for parsing text with "regular patterns". Now that that is done with, many supposed "regular expression" implementations (.Net for sure, and I believe Perl, Python, PHP, and several other well known implementations) support syntax for matching _irregular_ patterns (particularly nested, synonymous HTML/XML tags, etc.) The only potential problem with what you appear to be doing is if there can be tags nested inside the `vboxView` element and even more so if there can be more `vboxView` elements nested inside as well. if not, there's absolutely no problem — Code Jockey, Nov 08 '11 at 20:08
@Cfreak: I'm seeing nightmares about "by definition unstructured". That's incorrect on so many levels. The problem is that the structure is more complex than regular expressions can cope with; regular languages can cope with context-free grammars, but XML and similar languages have a context-dependent grammar (basically, you can't handle arbitrarily nested tags with regular expressions). — tripleee, Nov 09 '11 at 09:59

score 1 · Accepted Answer · edited May 23 '17 at 12:21

See my comment on the question for my rant on SGML-based languages and regex...

Now to my answer.

If you know there will not be any other HTML/XML elements inside the tag in question, then this will work quite well:

<vboxview\s(?P<vboxviewAttributes>(\\>|[^>])*)>(?P<vboxviewContent>(\\<|[^<])*)</vboxview>

Broken down, this expression says:

<vboxview                  # match `<vboxview` literally
\s+                        # match at least one whitespace character
(?P<vboxviewAttributes>    # begin capture (into a group named "vboxViewAttributes")
   (\\>|[^>])*             #    any number of (either `\>` or NOT `>`)
)                          # end capture
>                          # match a `>` character
(?P<vboxviewContent>       # begin capture (into a group named "vboxViewContent")
   (\\<|[^<])*             #    any number of (either `\<` or NOT `<`)
)                          # end capture
</vboxview>                # match `</vboxview>` literally

You will need to escape and > characters inside the source as \> or even better as HTML/XML entities

If there are going to be nested constructs inside, then you are either going to start running into problems with regex, or you will have already decided to use another method that does not involve regex - either way is sufficient!

thanks alot :) ... I am looking for other methods, but for now this seems good — Ahoura Ghotbi, Nov 08 '11 at 20:32

score 1 · Answer 2 · answered Nov 09 '11 at 08:38

As it has been mentioned in the comments it is usually not a good idea to try to extract things from HTML with regular expressions. If you ever want to switch to a more bulletproof method here's a quick example of how you could easily extract the information using the DOMDocument API.

<?php
function get_vboxview($html) {

    $output = array();

    // Create a new DOM object
    $doc = new DOMDocument;

    // load a string in as html
    $doc->loadHTML($html);

    // create a new Xpath object to query the document with
    $xpath = new DOMXPath($doc);

    // an xpath query that looks for a vboxview node anywhere in the DOM
    // with an attribute named leftinset set to 10, an attribute named rightinset
    // set to 0 and an attribute named stretchiness set to 1
    $query = '//vboxview[@leftinset=10 and @rightinset=0 and @stretchiness=1]';

    // query the document
    $matches = $xpath->query($query);

    // loop through each matching node
    // and the textContent to the output
    foreach ($matches as $m) {
            $output[] = $m->textContent;
    }

    return $output;
}
?>

Better yet if there is guaranteed to be only one vboxview in your input (also assuming you have control of the HTML) you could add an id attribute to vboxview and cut the code down to a shorter and more generalized function.

<?php
function get_node_text($html, $id) {
    // Create a new DOM object
    $doc = new DOMDocument;

    // load a string in as html
    $doc->loadHTML($html);

    // return the textContent of the node with the id $id
    return $doc->getElementById($id)->textContent;
}
?>

how to create a pattern for preg_match_all

2 Answers2