php regex to get string inside href tag

Question

I need a regex that will give me the string inside an href tag and inside the quotes also.

For example i need to extract theurltoget.com in the following:

<a href="theurltoget.com">URL</a>

Additionally, I only want the base url part. I.e. from http://www.mydomain.com/page.html i only want http://www.mydomain.com/

General Consensus: Don't use Regular Expressions to parse HTML. — drudge, Oct 22 '10 at 22:11
http://php.net/manual/en/class.domdocument.php and http://php.net/manual/en/function.parse-url.php is all you need. — dev-null-dweller, Oct 22 '10 at 22:15
Your data doesn't even contain a scheme. `href`'s may not always contain the scheme and domain. — webbiedave, Oct 22 '10 at 22:19
duplicate of [Regular expression for grabbing the href attribute of an A element](http://stackoverflow.com/questions/3820666/regular-expression-for-grabbing-the-href-attribute-of-an-a-element) — Gordon, Oct 23 '10 at 11:41
**Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester, Aug 02 '13 at 14:58
@AndyLester Hi Andy, the site is htmlparsing.com (with /php is a 404). — zeuf, Oct 08 '22 at 11:18
@zeuf I need to fix that. http://htmlparsing.com/php.html and http://htmlparsing.com/php should be the same. — Andy Lester, Oct 08 '22 at 17:55

score 23 · Answer 1 · answered Oct 22 '10 at 23:04

23

Dont use regex for this. You can use xpath and built in php functions to get what you want:

    $xml = simplexml_load_string($myHtml);
    $list = $xml->xpath("//@href");

    $preparedUrls = array();
    foreach($list as $item) {
        $item = parse_url($item);
        $preparedUrls[] = $item['scheme'] . '://' .  $item['host'] . '/';
    }
    print_r($preparedUrls);

answered Oct 22 '10 at 23:04

Drew Hunter

10,136
2
40
49

can provide a topic related $myHtml sample string please c: – AgelessEssence Jan 28 '13 at 09:31
1

How to check html is invaild. something like user enter the html `
` as `div>`
– Gowri Jul 10 '13 at 07:40
2

This is the most elegant way for extract the attributes from a HTML document. – Juan Lago Sep 12 '14 at 09:29

score 16 · Answer 2 · answered Oct 22 '10 at 22:17

16

$html = '<a href="http://www.mydomain.com/page.html">URL</a>';

$url = preg_match('/<a href="(.+)">/', $html, $match);

$info = parse_url($match[1]);

echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com

answered Oct 22 '10 at 22:17

Alec

9,000
9
39
43

2

This will not work when there are more attributes in the tag. E.g. href="http://www.mydomain.com/page.html" class="blue" rel=even". This returns [path] => /page.html" class="blue" rel=even" – Linkmichiel Aug 09 '13 at 12:31
Additionally: yes, this will work if you're only looking for the base url part (the 2nd part of the question by @David)! If you are looking for the entire url between the href, use another regex (I will try posting this in an answer below). – Linkmichiel Aug 14 '13 at 07:27

score 8 · Answer 3 · answered Aug 02 '13 at 14:55

8

this expression will handle 3 options:

no quotes
double quotes
single quotes

'/href=["\']?([^"\'>]+)["\']?/'

answered Aug 02 '13 at 14:55

ishubin

303
3
2

I have a case where this expression works, except when double quotes urls contain single quotes (like how the google maps urls do). Right now, your regex stops at the first single quote (even if the url is surrounded by double quotes) – CatalinBerta Jun 07 '16 at 13:48

score 7 · Answer 4 · answered Aug 14 '13 at 07:54

Use the answer by @Alec if you're only looking for the base url part (the 2nd part of the question by @David)!

$html = '<a href="http://www.mydomain.com/page.html" class="myclass" rel="myrel">URL</a>';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);

This will give you:

$info
Array
(
    [scheme] => http
    [host] => www.mydomain.com
    [path] => /page.html" class="myclass" rel="myrel
)

So you can use $href = $info["scheme"] . "://" . $info["host"] Which gives you:

// http://www.mydomain.com

When you are looking for the entire url between the href, You should be using another regex, for instance the regex provided by @user2520237.

$html = '<a href="http://www.mydomain.com/page.html" class="myclass" rel="myrel">URL</a>';
$url = preg_match('/href=["\']?([^"\'>]+)["\']?/', $html, $match);
$info = parse_url($match[1]);

this will give you:

$info
Array
(
    [scheme] => http
    [host] => www.mydomain.com
    [path] => /page.html
)

Now you can use $href = $info["scheme"] . "://" . $info["host"] . $info["path"]; Which gives you:

// http://www.mydomain.com/page.html

score 4 · Answer 5 · edited Sep 27 '12 at 14:14

For all href values replacement:

function replaceHref($html, $replaceStr)
{
    $match = array();
    $url   = preg_match_all('/<a [^>]*href="(.+)"/', $html, $match);

    if(count($match))
    {
        for($j=0; $j<count($match); $j++)
        {
            $html = str_replace($match[1][$j], $replaceStr.urlencode($match[1][$j]), $html);
        }
    }
    return $html;
}
$replaceStr  = "http://affilate.domain.com?cam=1&url=";
$replaceHtml = replaceHref($html, $replaceStr);

echo $replaceHtml;

score 4 · Answer 6 · answered Oct 22 '10 at 22:15

4

http://www.the-art-of-web.com/php/parse-links/

Let's start with the simplest case - a well formatted link with no extra attributes:

/<a href=\"([^\"]*)\">(.*)<\/a>/iU

answered Oct 22 '10 at 22:15

drudge

35,471
7
34
45

`/U` modifier fixed my problem. Thanks for the tip! – Arda Dec 18 '15 at 16:50
This won't work if there are other attributes in the `` element. – MrUpsidown Feb 08 '17 at 12:44

score 0 · Answer 7 · edited May 23 '17 at 12:02

0

This will handle the case where there are no quotes around the URL.

/<a [^>]*href="?([^">]+)"?>/

But seriously, do not parse HTML with regex. Use DOM or a proper parsing library.

edited May 23 '17 at 12:02

Community

1
1

answered Oct 22 '10 at 22:14

kijin

8,702
2
26
32

1

i am afraid of DOM a little bit, i believe that REGEX could save your life in many cases ... – Master345 Jan 18 '12 at 20:42
No quotes around the URL ... is that even syntactically correct? – AlxVallejo Aug 31 '15 at 13:10

score -1 · Answer 8 · answered Oct 22 '10 at 22:12

-1

/href="(https?://[^/]*)/

I think you should be able to handle the rest.

answered Oct 22 '10 at 22:12

Adam Byrtek

12,011
2
32
32

score -2 · Answer 9 · answered May 12 '14 at 03:59

-2

Because Positive and Negative Lookbehind are cool

/(?<=href=\").+(?=\")/

It will match only what you want, without quotation marks

Array ( [0] => theurltoget.com )

answered May 12 '14 at 03:59

Pablo S G Pacheco

2,550
28
28

php regex to get string inside href tag

9 Answers9

Linked

Related