Getting content of href value

Question

I need to catch the content of href using regex. For example, when I apply the rule to href="www.google.com", I'd like to get www.google.com. Also, I would like to ignore all hrefs which have only # in their value.

Now, I was playing around for some time, and I came up with this:

href=(?:\"|\')((?:[^#]|.#.|.#|#.)+)(?:\"|\')

When I try it out in http://www.rubular.com/ it works like a charm, but I need to use it with preg_replace_callback in PHP, and there I don't get the expected result (for testing it in PHP, I was using this site: http://www.pagecolumn.com/tool/pregtest.htm).

What's my mistake here?

First and foremost: [DON'T USE REGEX TO PARSE HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Madara's Ghost, Oct 28 '11 at 12:06

score 10 · Accepted Answer · edited May 23 '17 at 12:19

10

Since parsing HTML using regular expressions is a Bad Thing™, I suggest a less crude method:

$dom = new DomDocument;
$dom->loadHTML($pageContent);

$elements = $dom->getElementsByTagName('a');

for ($n = 0; $n < $elements->length; $n++) {
    $item = $elements->item($n);
    $href = $item->getAttribute('href');
    // here's your href attribute
}

edited May 23 '17 at 12:19

Community

1
1

answered Oct 28 '11 at 12:08

Linus Kleen

33,871
11
91
99

Great! Elegant and efficient solution! Great piece of advice too! Thanks! – misaizdaleka Oct 28 '11 at 12:26
@LinusKleen Why is it a bad thing? I have yet to understand why parsing HTML with regex is a bad thing. Oh and that guy didn't explain why, he just ranted. An explanation would help! – Mob Oct 28 '11 at 12:30
1

@Mob: The rant is indeed pointless (and without educative effect). When people say "parsing HTML" they actually mean "extraction". For which and in simple cases like that regular expressions are more than sufficient. For correctly parsing SGML and HTML (not so much XML and XHTML) you however need way more complex PCRE patterns. http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491 - It's just that the advise became a meme on SO and now the cursorily (and incorrect) answer is to never use regex. (Should depend on use case+effort.) – mario Oct 28 '11 at 12:33
LinusKleen and @mario Thanks. – Mob Oct 28 '11 at 14:05

score 1 · Answer 2 · answered Oct 28 '11 at 12:09

1

How about:

href\s*=\s*"([^#"]+#?[^"]*)"

answered Oct 28 '11 at 12:09

voidstate

7,937
4
40
52

score 1 · Answer 3 · answered Oct 28 '11 at 12:10

1

First and foremost: DON'T USE REGEX TO PARSE HTML

I would go with something like:

href=("|')?([^\s"'])+("|')?

answered Oct 28 '11 at 12:10

Madara's Ghost

172,118
50
264
308

Getting content of href value

3 Answers3