-1

I need to catch the content of href using regex. For example, when I apply the rule to href="www.google.com", I'd like to get www.google.com. Also, I would like to ignore all hrefs which have only # in their value.

Now, I was playing around for some time, and I came up with this:

href=(?:\"|\')((?:[^#]|.#.|.#|#.)+)(?:\"|\')

When I try it out in http://www.rubular.com/ it works like a charm, but I need to use it with preg_replace_callback in PHP, and there I don't get the expected result (for testing it in PHP, I was using this site: http://www.pagecolumn.com/tool/pregtest.htm).

What's my mistake here?

mario
  • 144,265
  • 20
  • 237
  • 291
misaizdaleka
  • 1,776
  • 3
  • 21
  • 32
  • 1
    First and foremost: [DON'T USE REGEX TO PARSE HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Madara's Ghost Oct 28 '11 at 12:06

3 Answers3

10

Since parsing HTML using regular expressions is a Bad Thing™, I suggest a less crude method:

$dom = new DomDocument;
$dom->loadHTML($pageContent);

$elements = $dom->getElementsByTagName('a');

for ($n = 0; $n < $elements->length; $n++) {
    $item = $elements->item($n);
    $href = $item->getAttribute('href');
    // here's your href attribute
}
Community
  • 1
  • 1
Linus Kleen
  • 33,871
  • 11
  • 91
  • 99
  • Great! Elegant and efficient solution! Great piece of advice too! Thanks! – misaizdaleka Oct 28 '11 at 12:26
  • @LinusKleen Why is it a bad thing? I have yet to understand why parsing HTML with regex is a bad thing. Oh and that guy didn't explain why, he just ranted. An explanation would help! – Mob Oct 28 '11 at 12:30
  • 1
    @Mob: The rant is indeed pointless (and without educative effect). When people say "parsing HTML" they actually mean "extraction". For which and in simple cases like that regular expressions are more than sufficient. For correctly parsing SGML and HTML (not so much XML and XHTML) you however need way more complex PCRE patterns. http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491 - It's just that the advise became a meme on SO and now the cursorily (and incorrect) answer is to never use regex. (Should depend on use case+effort.) – mario Oct 28 '11 at 12:33
  • LinusKleen and @mario Thanks. – Mob Oct 28 '11 at 14:05
1

How about:

href\s*=\s*"([^#"]+#?[^"]*)"
voidstate
  • 7,937
  • 4
  • 40
  • 52
1

First and foremost: DON'T USE REGEX TO PARSE HTML


I would go with something like:

href=("|')?([^\s"'])+("|')?
Madara's Ghost
  • 172,118
  • 50
  • 264
  • 308