0

I want to match all href values in my page content. I wrote regex for that and tested it on regex101

href[ ]*=[ ]*("|')(.+?)\1

This finds all my href values properly. If I use

href[ ]*=[ ]*(?:"|')(.+?)(?:"|')

its even better since I do not have to use certain group later.

With " and ' in regex string I cannot run the regex properly with

$matches = array();  
$pattern = "/href[ ]*=[ ]*("|')(.+?)\1/"; // syntax error 
$numOfMatches = preg_match_all($pattern, $pattern, $matches);  
print_r($matches);

If I "escape" double quote and thus repair the syntax error I get no matches.

So - what is the correct way to apply the given regex in PHP?

Thanks for any help

Notes:

  • addslashes or preg_quote won't help since I need to pass legit string first
  • escaping all the special chars \ + * ? [ ^ ] $ ( ) { } = ! < > | : - didn't help either

EDIT: Ok, I see I really shouldn't be doing this with regex. Could you please provide some helpful DOM parsers or any other tool I 'should' use with PHP for instance ?

trainoasis
  • 6,419
  • 12
  • 51
  • 82
  • 2
    For finding all href links regex is not the right tool. – anubhava Jul 16 '14 at 13:38
  • To escape chars just use the backslash like: `\"` (quotes) or `\(` (brace) or \\ (backslash) – Armin Jul 16 '14 at 13:40
  • Haha, this escaping thing doesn't let me highlight \\ as code ^^ – Armin Jul 16 '14 at 13:41
  • 1
    Don't use regexes. This'd be far easier with DOM+XPath: `//*[@href]` and done. And just because you ARE defining a regex doesn't excuse you from PHP's string syntax rules. If your string contains the same quote that delimits the string, that internal quote must be escaped. – Marc B Jul 16 '14 at 21:37
  • 1
    Are you trying to [parse HTML with regular expressions](http://stackoverflow.com/a/1732454/53114)? Why not using a proper parser? – Gumbo Jul 16 '14 at 21:38
  • [Don't use regex to parse HTML!](http://stackoverflow.com/a/1732454/2812842) Read a **basic** XPath tutorial and you'll see that returning all href attributes in a document is easy. Regex will kill you one day. – scrowler Jul 16 '14 at 21:39
  • Thanks guys, I see I really shouldn't use regex here. Please see my edited question – trainoasis Jul 17 '14 at 06:03

2 Answers2

0

For your case, the following should work:

/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

Given the nature of the WWW there are always going to be cases where the regular expression breaks down. Small changes to the patterns can fix these.

spaces around the = after href:

/<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

matching only links starting with http:

/<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

single quotes around the link address:

/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

Source

Rishi Dua
  • 2,296
  • 2
  • 24
  • 35
0

I had to use this regex to make it work. Next time I will definitely try with DOM parser :)

$regexForHREF = "/href[ ]*=[ ]*(?:\"|')(.+?)(?:\"|')/";
trainoasis
  • 6,419
  • 12
  • 51
  • 82