2

I already have a function that retrieves the href attribute from all of the a tags on a given page of markup. However, I would also like to retrieve other attributes, namely the title attribute.

I have a feeling it's a simple modification of the regular expression that I'm already using, but my only concern is the order of appearance in the markup. If I have a link with this code:

<a href="somepage.html" title="My Page">link text</a>

I want it to be parsed the same and not cause any errors even if it appears like this:

<a title="My Page" href="somepage.html">link text</a>

Here is my processing function:

function getLinks($src) {
    if(preg_match_all('/<a\s+href=["\']([^"\']+)["\']/i', $src, $links, PREG_PATTERN_ORDER))
        return array_unique($links[1]);
    return false;
}

Would I have to use another regex all together, or would it be possible to modify this one so that the title attribute is stored in the same array of returned data as the href attribute?

SISYN
  • 2,209
  • 5
  • 24
  • 45
  • 1
    Don't use regex to parse HTML, use an HTML parser instead. – Ibrahim Najjar Sep 01 '13 at 23:01
  • Just as a tip, you should probably be using a proper HTML parser instead of regular expressions. – Adrian Wragg Sep 01 '13 at 23:01
  • Any suggestions on HTML parsers? I tend to the do things the hard way (unknowingly) so I'm not familiar with any. – SISYN Sep 01 '13 at 23:03
  • @danl Take a look at the suggestions in http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php. – Adrian Wragg Sep 01 '13 at 23:10
  • Does this answer your question? [How do you parse and process HTML/XML in PHP?](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – Nico Haase Jun 01 '20 at 17:49

2 Answers2

3

You can build on that regex. Have a look:

'/<a(?:\s+(?:href=["\'](?P<href>[^"\'<>]+)["\']|title=["\'](?P<title>[^"\'<>]+)["\']|\w+=["\'][^"\'<>]+["\']))+/i'

...or in human-readable form:

preg_match_all(
    '/<a
    (?:\s+
      (?:
         href=["\'](?P<href>[^"\'<>]+)["\']
        |
         title=["\'](?P<title>[^"\'<>]+)["\']
        |
         \w+=["\'][^"\'<>]+["\']
      )
    )+/ix', 
    $subject, $result, PREG_PATTERN_ORDER);

Pretty self explanatory, I think. Note that your original regex has the same problem vis-à-vis order of appearance. For example, it would fail to match this tag:

<a class="someclass" href="somepage.html">link text</a>

Unless you're absolutely sure there will be no other attributes, you can't reasonably expect href to be listed first. You can use the same gimmick as above, where the second branch silently consumes and discards the attributes that don't interest you:

    '/<a
    (?:\s+
      (?:
         href=["\'](?P<href>[^"\'<>]+)["\']
        |
         \w+=["\'][^"\'<>]+["\']
      )
    )+/ix', 
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
1

Try this regextrainer I made a while back.

The sample contains a pattern like this: <([^ ]+) ?([^>]*)>([^<]*)< ?/ ?\1> which will capture attributes in html.

I see now that it doesn't extract the attribute name and value, just the whole attribute text itself. Use this to extract the attribute details: ((([^=]+)=((?:"|'))([^"']+)\4) ?)+

Jo Are By
  • 3,293
  • 1
  • 11
  • 11