0

I want to parse an HTML document and get all users' nicknames.

They are in this format:

<a href="/nickname_u_2412477356587950963">Nickname</a>

How can I do it using a regular expression in PHP? I can't use DOMElement or simple HTML parsing.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 5
    Oblig: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Oded Aug 14 '11 at 16:43
  • 6
    Out of pure curiosity, why can't you use an HTML parser? –  Aug 14 '11 at 16:44
  • You don't need a regex, you can do it with [DomDocument::loadHTML()](http://www.php.net/manual/en/domdocument.loadhtml.php). – Arnaud Le Blanc Aug 14 '11 at 16:46

2 Answers2

3

Here is a working solution without using a regular expression:

DomDocument::loadHTML() is forgetting enough to work on malformed HTML.

<?php
    $doc = new DomDocument;
    $doc->loadHTML('<a href="/nickname_u_2412477356587950963">Nickname</a>');

    $xpath = new DomXPath($doc);
    $nodes = $xpath->query('//a[starts-with(@href, "/nickname")]');

    foreach($nodes as $node) {
        $username = $node->textContent;
        $href = $node->getAttribute('href');
        printf("%s => %s\n", $username, $href);
    }
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Arnaud Le Blanc
  • 98,321
  • 23
  • 206
  • 194
3
preg_match_all(
    '{                  # match when
        nickname_u_     # there is nickname_u
        [\d+]*          # followed by any number of digits
        ">              # followed by quote and closing bracket
        (.*)?           # capture anything that follows
        </a>            # until the first </a> sequence
    }xm',
    '<a href="/nickname_u_2412477356587950963">Nickname</a>',
    $matches
);
print_r($matches);

Usual disclaimers for using Regex on HTML over an HTML parser apply. Above can probably be improved to more reliable matching. It will work for the example you gave though.

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559