regex to select between and while ignoring all text inside any <>

Question

I have the following two types of text:

Type one:

<div class="meta-name">Corporate Officers</div>
<div class="meta-data"><table border="0" cellspacing="0" cellpadding="0" width="171">
<col width="171"></col>
<tbody>
<tr height="19">
<td width="171" height="19">Officer One</td>
</tr>
</tbody>
</table> 
</div>
</div>

Type two:

<div class="meta-name">Corporate Officers</div>
<div class="meta-data">Officer Two</div>
</div>
<pre>

I'm using php with preg_match_all. I need a single expression that will return Officer One and Officer Two from the above. I'm using Corporate Officers< /div> as the first anchor and< /div> as the second, but I can't find Keith Dennis inside all that table gibberish.

How do I return text between anchor1 and anchor2 while ignoring all text inside any brackets <> between?

I saw these threads but wasn't able to make their solutions work for me: RegEx: extract everything until X where X is not between two braces

everything, but everything between [ and ]

Have you considered using an HTML parser? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Mark Byers, Nov 19 '11 at 21:58
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Nov 19 '11 at 22:18

Bailey Parker · Answer 1 · 2011-11-19T22:41:48.600

With SimpleXML:

$xml = new SimpleXMLElement('<div>
    <div class="meta-name">
        Corporate Officers
    </div>
    <div class="meta-data">
        <table border="0" cellspacing="0" cellpadding="0" width="171">
            <col width="171" />
            <tbody>
                <tr height="19">
                    <td width="171" height="19">
                        Officer One
                    </td>
                </tr>
            </tbody>
        </table>
    </div>
</div>
');

$results = array();
foreach($xml->children() as $node) {
    if($node->getName() == 'div') {
        $attributes = $node->attributes();
        $classes = explode(' ', $attributes['class']);
        if(in_array('meta-name', $classes) || in_array('meta-data', $classes)) {
            $results[] = getText($node);
        }
    }
}

function getText($node) {
    $text = trim(sprintf('%s', $node));
    if(strlen($text) !== 0) {
        return $text;
    }

    foreach($node->children() as $child) {
        if($text = getText($child)) {
            return $text;
        }
    }

    return null;
}

var_dump($results);

As a general rule of thumb, never use Regex to parse HTML.

FailedDev · Answer 2 · 2011-11-19T22:20:31.920

0

About 80% of regex questions is about xml/html/xhtml. And about 75% of the answer is to not use a regex. Why? Because while it may seem to work for your example it will be fragile and may break with a slight change of the input.

Please take a look at this beautiful tool. If you can't use it then come back and we will provide with help.

edited Nov 19 '11 at 22:20

answered Nov 19 '11 at 22:04

FailedDev

26,680
9
53
73

score -1 · Answer 3 · answered Nov 20 '11 at 11:00

Try this regex:

'~<div\b[^>]*>Corporate\s+Officers</div>\s*<div\b[^>]*>(?:<(?!/?div\b)[^>]*>|\s+)*\K[^<]+~'

This is based on the assumption that there's no other text content in the HTML between the opening <div> tags and the names you're looking for. The first part is self-explanatory:

<div\b[^>]*>Corporate\s+Officers</div>\s*<div\b[^>]*>

I'm assuming the "Corporate Officers" text is sufficient to locate the starting point, but you can reinsert the class attributes if necessary. After that,

(?:<(?!/?div\b)[^>]*>|\s+)*

...consumes any number of tags other than <div> or </div> tags, along with any intervening whitespace. Then \K comes along and says forget all that, the real match starts here. [^<]+ consumes everything up to the beginning of the next tag, and that's all you see in the match results. It's as if everything before the \K was really a positive lookbehind, but without all the restrictions.

Here's a demo.

@Tom, I merely answered the question that was asked. If you downvote everyone who does this, you'll run out of rep in no time. Or do you plan to downvote only the answers that actually work? ;) — Alan Moore, Nov 20 '11 at 15:50
That's why I plan on answering more and more questions for more rep. To downvote ALL the other answers ! — Tom, Nov 20 '11 at 16:41

regex to select between and while ignoring all text inside any <>

3 Answers3