HTML-parsing regular expression

Question

I want to parse an HTML document and get all users' nicknames.

They are in this format:

<a href="/nickname_u_2412477356587950963">Nickname</a>

How can I do it using a regular expression in PHP? I can't use DOMElement or simple HTML parsing.

Oblig: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Oded, Aug 14 '11 at 16:43
You don't need a regex, you can do it with [DomDocument::loadHTML()](http://www.php.net/manual/en/domdocument.loadhtml.php). — Arnaud Le Blanc, Aug 14 '11 at 16:46

score 3 · Accepted Answer · edited Oct 22 '13 at 17:23

3

Here is a working solution without using a regular expression:

DomDocument::loadHTML() is forgetting enough to work on malformed HTML.

<?php
    $doc = new DomDocument;
    $doc->loadHTML('<a href="/nickname_u_2412477356587950963">Nickname</a>');

    $xpath = new DomXPath($doc);
    $nodes = $xpath->query('//a[starts-with(@href, "/nickname")]');

    foreach($nodes as $node) {
        $username = $node->textContent;
        $href = $node->getAttribute('href');
        printf("%s => %s\n", $username, $href);
    }

edited Oct 22 '13 at 17:23

Peter Mortensen

30,738
21
105
131

answered Aug 14 '11 at 16:56

Arnaud Le Blanc

98,321
23
206
194

Thanks, but you guys didnt understand what I meant, "nickname" is a variable and so is this random set of numbers – André Cardoso Aug 14 '11 at 19:24

score 3 · Answer 2 · edited May 23 '17 at 12:26

preg_match_all(
    '{                  # match when
        nickname_u_     # there is nickname_u
        [\d+]*          # followed by any number of digits
        ">              # followed by quote and closing bracket
        (.*)?           # capture anything that follows
        </a>            # until the first </a> sequence
    }xm',
    '<a href="/nickname_u_2412477356587950963">Nickname</a>',
    $matches
);
print_r($matches);

Usual disclaimers for using Regex on HTML over an HTML parser apply. Above can probably be improved to more reliable matching. It will work for the example you gave though.

HTML-parsing regular expression

2 Answers2