Grab URL within a string which contains HTML code

Question

I have a string, for example:

$html = '<p>hello<a href="https://www.youtube.com/watch?v=7HknMcG2qYo">world</a></p><p>hello<a href="https://youtube.com/watch?v=37373o">world</a></p>';

And I want to search the string for the first URL that starts with youtube.com or youtu.be and store it in variable $first_found_youtube_url.

How can I do this efficiently?

I can do a preg_match or strpos looking for the urls but not sure which approach is more appropriate.

The appropriate thing to do, would probably be to use a parser and parse it as what it is, HTML ? — adeneo, Dec 23 '15 at 23:38
@adeneo Can you please elaborate? I don't understand your approach. English is not my first language and I don't pickup on technical terms that easily. — Henrik Petterson, Dec 23 '15 at 23:39

score 4 · Accepted Answer · edited Dec 24 '15 at 00:04

4

I wrote this function a while back, it uses regex and returns an array of unique urls. Since you want the first one, you can just use the first item in the array.

function getUrlsFromString($string) {
    $regex = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#i';
    preg_match_all($regex, $string, $matches);
    $matches = array_unique($matches[0]);           
    usort($matches, function($a, $b) {
        return strlen($b) - strlen($a);
    });
    return $matches;
}

Example:

$html = '<p>hello<a href="https://www.youtube.com/watch?v=7HknMcG2qYo">world</a></p><p>hello<a href="https://youtube.com/watch?v=37373o">world</a></p>';
$urls = getUrlsFromString($html);
$first_found_youtube = $urls[0];

With YouTube specific regex:

function getYoutubeUrlsFromString($string) {
    $regex = '#(https?:\/\/(?:www\.)?(?:youtube.com\/watch\?v=|youtu.be\/)([a-zA-Z0-9]*))#i';
    preg_match_all($regex, $string, $matches);
    $matches = array_unique($matches[0]);           
    usort($matches, function($a, $b) {
        return strlen($b) - strlen($a);
    });
    return $matches;
}

Example:

$html = '<p>hello<a href="https://www.youtube.com/watch?v=7HknMcG2qYo">world</a></p><p>hello<a href="https://youtube.com/watch?v=37373o">world</a></p>';
$urls = getYoutubeUrlsFromString($html);
$first_found_youtube = $urls[0];

edited Dec 24 '15 at 00:04

h2ooooooo

39,111
8
68
102

answered Dec 23 '15 at 23:39

skrilled

5,350
2
26
48

Interesting, I'm not sure if this is the best approach but I'll bite. Can you please alter the regex to only include youtube.com and youtu.be links? – Henrik Petterson Dec 23 '15 at 23:41
1

@HenrikPetterson try this `https?:\/\/(?:www\.)?(?:youtube.com\/watch\?v=|youtu.be\/)(.*)` - the only capture group contains the video ID – Tyler Sebastian Dec 23 '15 at 23:42
@TylerSebastian Thanks for the post but seems quite incomplete. E.g. http should be there and https... – Henrik Petterson Dec 23 '15 at 23:45
@HenrikPetterson it is there... do you mean captured? just put brackets around the whole thing: `(https?:\/\/(?:www\.)?(?:youtube.com\/watch\?v=|youtu.be\/)(.*))` - the first capture group would be the whole thing, the second would be the video id. – Tyler Sebastian Dec 23 '15 at 23:47
2

@TylerSebastian Please try the following to understand the problem: http://pastebin.com/6zS2fAus – Henrik Petterson Dec 23 '15 at 23:51
1

Use `(https?:\/\/(?:www\.)?(?:youtube.com\/watch\?v=|youtu.be\/)([a-zA-Z0-9]*))`. http://regexr.com/3cfhn – Tyler Sebastian Dec 23 '15 at 23:53
1

Thanks @TylerSebastian - updated answer to include your regex with a rewritten function. – skrilled Dec 23 '15 at 23:55
1

Make sure you make your regex in-case-sensitive (`#i`) in case you get capital links like `HTTP://WWW.WEBSITE.NET`. – h2ooooooo Dec 24 '15 at 00:00
Parsing HTML with regex is generally a bad idea. -- http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – hanshenrik Dec 24 '15 at 00:52
1

@skrilled I have a request. I will bounty this question with 100 points when eligible given that I'm asking for something beyond the initial question. Can you please make the regex identify all types of YouTube links, there are more than I thought, here is the list with a regex: http://stackoverflow.com/a/5831191/1185126 And then, can you make the this function convert it into a simply youtube.com/watch?v=... type of URL? The answer I linked already has all of this, just can't figure out how to merge these together into one function. – Henrik Petterson Dec 24 '15 at 16:18

hanshenrik · Answer 2 · 2015-12-24T00:39:54.557

you can parse the html with DOMDocument and look for youtube url's with stripos, something like this

$html = '<p>hello<a href="https://www.youtube.com/watch?v=7HknMcG2qYo">world</a></p><p>hello<a href="https://youtube.com/watch?v=37373o">world</a></p>';
$DOMD = @DOMDocument::loadHTML($html);

foreach($DOMD->getElementsByTagName("a") as $url)
{
    if (0 === stripos($url->getAttribute("href") , "https://www.youtube.com/") || 0 === stripos($url->getAttribute("href") , "https://www.youtu.be"))
    {
        $first_found_youtube_url = $url->getAttribute("href");
        break;
    }
}

personally, i would probably use

"youtube.com"===parse_url($url->getAttribute("href"),PHP_URL_HOST)

though, as it would get http AND https links.. which is probably what you want, though strictly speaking, not what you're asking for in top post right now..

score 0 · Answer 3 · answered Dec 24 '15 at 00:38

I think this will do what you are looking for, I have used preg_match_all simply because I find it easier to debug the regexes.

<?php

$html = '<p>hello<a href="https://www.youtu.be/watch?v=7HknMcG2qYo">world</a></p><p>hello<a href="https://youtube.com/watch?v=37373o">world</a></p>';

$pattern = '/https?:\/\/(www\.)?youtu(\.be|\com)\/[a-zA-Z0-9\?=]*/i';
preg_match_all($pattern, $html, $matches);

// print_r($matches);
$first_found_youtube = $matches[0][0];
echo $first_found_youtube;

demo - https://3v4l.org/lFjmK

Grab URL within a string which contains HTML code

3 Answers3