17

I need a regex that will give me the string inside an href tag and inside the quotes also.

For example i need to extract theurltoget.com in the following:

<a href="theurltoget.com">URL</a>

Additionally, I only want the base url part. I.e. from http://www.mydomain.com/page.html i only want http://www.mydomain.com/

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
David
  • 395
  • 1
  • 5
  • 8
  • 1
    General Consensus: Don't use Regular Expressions to parse HTML. – drudge Oct 22 '10 at 22:11
  • Ok, how can i get the href tag then using php – David Oct 22 '10 at 22:12
  • 4
    http://php.net/manual/en/class.domdocument.php and http://php.net/manual/en/function.parse-url.php is all you need. – dev-null-dweller Oct 22 '10 at 22:15
  • Your data doesn't even contain a scheme. `href`'s may not always contain the scheme and domain. – webbiedave Oct 22 '10 at 22:19
  • duplicate of [Regular expression for grabbing the href attribute of an A element](http://stackoverflow.com/questions/3820666/regular-expression-for-grabbing-the-href-attribute-of-an-a-element) – Gordon Oct 23 '10 at 11:41
  • 3
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Aug 02 '13 at 14:58
  • 1
    @AndyLester Hi Andy, the site is htmlparsing.com (with /php is a 404). – zeuf Oct 08 '22 at 11:18
  • @zeuf I need to fix that. http://htmlparsing.com/php.html and http://htmlparsing.com/php should be the same. – Andy Lester Oct 08 '22 at 17:55

9 Answers9

23

Dont use regex for this. You can use xpath and built in php functions to get what you want:

    $xml = simplexml_load_string($myHtml);
    $list = $xml->xpath("//@href");

    $preparedUrls = array();
    foreach($list as $item) {
        $item = parse_url($item);
        $preparedUrls[] = $item['scheme'] . '://' .  $item['host'] . '/';
    }
    print_r($preparedUrls);
Drew Hunter
  • 10,136
  • 2
  • 40
  • 49
16
$html = '<a href="http://www.mydomain.com/page.html">URL</a>';

$url = preg_match('/<a href="(.+)">/', $html, $match);

$info = parse_url($match[1]);

echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com
Alec
  • 9,000
  • 9
  • 39
  • 43
  • 2
    This will not work when there are more attributes in the tag. E.g. href="http://www.mydomain.com/page.html" class="blue" rel=even". This returns [path] => /page.html" class="blue" rel=even" – Linkmichiel Aug 09 '13 at 12:31
  • Additionally: yes, this will work if you're only looking for the base url part (the 2nd part of the question by @David)! If you are looking for the entire url between the href, use another regex (I will try posting this in an answer below). – Linkmichiel Aug 14 '13 at 07:27
8

this expression will handle 3 options:

  1. no quotes
  2. double quotes
  3. single quotes

'/href=["\']?([^"\'>]+)["\']?/'

ishubin
  • 303
  • 3
  • 2
  • I have a case where this expression works, except when double quotes urls contain single quotes (like how the google maps urls do). Right now, your regex stops at the first single quote (even if the url is surrounded by double quotes) – CatalinBerta Jun 07 '16 at 13:48
7

Use the answer by @Alec if you're only looking for the base url part (the 2nd part of the question by @David)!

$html = '<a href="http://www.mydomain.com/page.html" class="myclass" rel="myrel">URL</a>';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);

This will give you:

$info
Array
(
    [scheme] => http
    [host] => www.mydomain.com
    [path] => /page.html" class="myclass" rel="myrel
)

So you can use $href = $info["scheme"] . "://" . $info["host"] Which gives you:

// http://www.mydomain.com  

When you are looking for the entire url between the href, You should be using another regex, for instance the regex provided by @user2520237.

$html = '<a href="http://www.mydomain.com/page.html" class="myclass" rel="myrel">URL</a>';
$url = preg_match('/href=["\']?([^"\'>]+)["\']?/', $html, $match);
$info = parse_url($match[1]);

this will give you:

$info
Array
(
    [scheme] => http
    [host] => www.mydomain.com
    [path] => /page.html
)

Now you can use $href = $info["scheme"] . "://" . $info["host"] . $info["path"]; Which gives you:

// http://www.mydomain.com/page.html
Linkmichiel
  • 2,110
  • 4
  • 23
  • 27
4

For all href values replacement:

function replaceHref($html, $replaceStr)
{
    $match = array();
    $url   = preg_match_all('/<a [^>]*href="(.+)"/', $html, $match);

    if(count($match))
    {
        for($j=0; $j<count($match); $j++)
        {
            $html = str_replace($match[1][$j], $replaceStr.urlencode($match[1][$j]), $html);
        }
    }
    return $html;
}
$replaceStr  = "http://affilate.domain.com?cam=1&url=";
$replaceHtml = replaceHref($html, $replaceStr);

echo $replaceHtml;
Mihai Iorga
  • 39,330
  • 16
  • 106
  • 107
Basani
  • 49
  • 1
4

http://www.the-art-of-web.com/php/parse-links/

Let's start with the simplest case - a well formatted link with no extra attributes:

/<a href=\"([^\"]*)\">(.*)<\/a>/iU
drudge
  • 35,471
  • 7
  • 34
  • 45
0

This will handle the case where there are no quotes around the URL.

/<a [^>]*href="?([^">]+)"?>/

But seriously, do not parse HTML with regex. Use DOM or a proper parsing library.

Community
  • 1
  • 1
kijin
  • 8,702
  • 2
  • 26
  • 32
-1
/href="(https?://[^/]*)/

I think you should be able to handle the rest.

Adam Byrtek
  • 12,011
  • 2
  • 32
  • 32
-2

Because Positive and Negative Lookbehind are cool

/(?<=href=\").+(?=\")/

It will match only what you want, without quotation marks

Array ( [0] => theurltoget.com )

Pablo S G Pacheco
  • 2,550
  • 28
  • 28