Having a bit of regex headaches with varied links and href delimiters (" and ')

Question

So, I want to match the following link structures with a preg_match_all in php..

<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>

I can get " and ' deilmited urls one by doing

'#<a[^>]*?href=("|\')(.*?)("|\')#is'

or I can get all 3, but not if there are spaces in the first two with:

'#<a[^>]*?href=("|\')?(.*?)[\s\"\'>]#is'

How can I formulate this so that it will pick up " and ' delimited with potential spaces, but also properly encoded URLs without delimiters.

[The cannot hold it is too late. ](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) - aka don't parse html with regex, it's not possible... — ircmaxell, Nov 06 '10 at 02:54
BTW, I recommend using this syntax: [ab] instead of: (a|b) because it's more common (easier for most of us to read), shorter, and probably faster. — JasonWoof, Nov 09 '10 at 10:13

JasonWoof · Accepted Answer · 2010-11-06T02:58:48.247

1

OK, this seems to work:

'#<a[^>]*?href=((["\'][^\'"]+["\'])|([^"\'\s>]+))#is'

($matches[1] contains the urls)

Only annoyance is that quoted urls have the quotes still on, so you'll have to strip them off:

$first = substr($match, 0, 1);
if($first == '"' || $first == "'")
    $match = substr($match, 1, -1);

edited Nov 06 '10 at 02:58

answered Nov 06 '10 at 02:50

JasonWoof

4,176
1
19
28

Great, this is perfect for this application. I can rehash the results after to trim off any quotes. I just wanted to avoid running two preg_match_all's to get the links with and without delimiters, this is an acceptable solution! as for the extra quotes, preg_replace("#('|")#", "", $subject) recursively does the trick. – tweak2 Nov 06 '10 at 11:19
1

trim($subject, "\"'") does the trick for sanitizing after too, as Alan pointed out. It is likely less resource intensive. – tweak2 Nov 06 '10 at 11:26

Dan Horrigan · Answer 2 · 2010-11-06T03:01:20.717

EDIT: I have edited this to work a little better than I originally posted.

You almost have it in the second regex:

'#<a[^>]*?href=("|\')?(.*?)[\\1|>]#is'

Returns the following array:

array(3) {
  [0]=>
  array(4) {
    [0]=>
    string(92) "<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>"
    [1]=>
    string(101) "<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>"
    [2]=>
    string(94) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>"
    [3]=>
    string(77) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>"
  }
  [1]=>
  array(4) {
    [0]=>
    string(1) """
    [1]=>
    string(1) "'"
    [2]=>
    string(0) ""
    [3]=>
    string(0) ""
  }
  [2]=>
  array(4) {
    [0]=>
    string(74) "http://this.is.a.link.com/?query=this has invalid spaces" possible garbage"
    [1]=>
    string(83) "http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage"
    [2]=>
    string(77) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage"
    [3]=>
    string(60) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters"
  }
}

Works with or without delimiters.

That last bit should be `(?:\1|>)`, not `[\\1|>]`. Backreferences don't work in character classes, and the OR operator isn't needed. That actually matches one of: backslash, `1`, `|`, or `>`. On the other hand, `("|\')`, while not incorrect, would be much more efficient if you used a character class instead: `(["\'])` — Alan Moore, Nov 06 '10 at 08:05

score 1 · Answer 3 · answered Nov 06 '10 at 07:34

1

Use a DOM parser. You cannot parse (x)HTML with regular expressions.

$html = <<<END
<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>
END;

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($html);
libxml_use_internal_errors(false);

$items = $domd->getElementsByTagName("a");
foreach ($items as $item) {
  var_dump($item->getAttribute("href"));
}

answered Nov 06 '10 at 07:34

Maerlyn

33,687
18
94
85

I would think a dom parser itself uses some form of regex internally and would involve some unnecessary overhead. In my experience regex is incredibly fast. If I were doing any form of page data harvesting beyond simple links, I would be using a dom parser and specifying groups. – tweak2 Nov 06 '10 at 11:16
@tweak2: It does not use some form of regex. XML/HTML is not a regular language, so it's not possible to use a regex. This is the best and most robust solution by far. +1 – ircmaxell Nov 06 '10 at 12:00
What is it with all you people and "not a regular language" crap? Patterns haven’t been regular since they got backreferences, let alone a heck of a lot of other stuff like recursion. Your theoretical answers are completely irrelevant to the domain of what modern patterns can parse. – tchrist Nov 06 '10 at 12:31

score 0 · Answer 4 · answered Nov 06 '10 at 02:56

0

When you say you want to match them, are you trying to extract information out of the links, or simply find hyperlinks with a href? If you're after only the latter, this should work just fine:

/<a[^>]*href=[^\s].*?>/

answered Nov 06 '10 at 02:56

Chris

9,994
3
29
31

this cuts out in the links that are " or ' delimited and have spaces in them. – tweak2 Nov 06 '10 at 11:16

score 0 · Answer 5 · answered Nov 06 '10 at 09:45

As @JasonWoof indicated, you need to use an embedded alternation: one alternative for quoted URLs, one for non-quoted. I also recommend using a capturing group to determine which kind of quote is being used, as @DanHorrigan did. With the addition of a negative lookahead ((?!\\2)) and possessive quantifiers (*+), you can create a highly robust regex that is also very quick:

~
<a\\s+[^>]*?\\bhref=
(
  (["'])          # capture the opening quote
  (?:(?!\\2).)*+  # anything else, zero or more times
  \\2             # match the closing quote
|
  [^\\s>]*+   # anything but whitespace or closing brackets
)
~ix

See it in action on ideone. (The doubled backslashes are because the regex is written in the form of a PHP heredoc. I'd prefer to use a nowdoc, but ideone is apparently still running PHP 5.2.)

Having a bit of regex headaches with varied links and href delimiters (" and ')

5 Answers5