18

I'm trying to validate YouTube URLs for my application.

So far I have the following:

// Set the youtube URL
$youtube_url = "www.youtube.com/watch?v=vpfzjcCzdtCk";

if (preg_match("/((http\:\/\/){0,}(www\.){0,}(youtube\.com){1} || (youtu\.be){1}(\/watch\?v\=[^\s]){1})/", $youtube_url) == 1)
{
    echo "Valid";
else
{
    echo "Invalid";
}

I wish to validate the following variations of Youtube Urls:

  • With and without http://
  • With and without www.
  • With the URLs youtube.com and youtu.be
  • Must have /watch?v=
  • Must have the unique video string (In the example above "vpfzjcCzdtCk")

However, I don't think I've got my logic right, because for some reason it returns true for: www.youtube.co/watch?v=vpfzjcCzdtCk (Notice I've written it incorrectly with .co and not .com)

Luke
  • 22,826
  • 31
  • 110
  • 193
  • Possible duplicate of [Regular Expression Youtube URL](http://stackoverflow.com/questions/8306963/regular-expression-youtube-url) – Glenn Slayden Feb 01 '17 at 19:51

5 Answers5

39

There are a lot of redundancies in this regular expression of yours (and also, the leaning toothpick syndrome). This, though, should produce results:

$rx = '~
  ^(?:https?://)?                           # Optional protocol
   (?:www[.])?                              # Optional sub-domain
   (?:youtube[.]com/watch[?]v=|youtu[.]be/) # Mandatory domain name (w/ query string in .com)
   ([^&]{11})                               # Video id of 11 characters as capture group 1
    ~x';

$has_match = preg_match($rx, $url, $matches);

// if matching succeeded, $matches[1] would contain the video ID

Some notes:

  • use the tilde character ~ as delimiter, to avoid LTS
  • use [.] instead of \. to improve visual legibility and avoid LTS. ("Special" characters - such as the dot . - have no effect in character classes (within square brackets))
  • to make regular expressions more "readable" you can use the x modifier (which has further implications; see the docs on Pattern modifiers), which also allows for comments in regular expressions
  • capturing can be suppressed using non-capturing groups: (?: <pattern> ). This makes the expression more efficient.

Optionally, to extract values from a (more or less complete) URL, you might want to make use of parse_url():

$url = 'http://youtube.com/watch?v=VIDEOID';
$parts = parse_url($url);
print_r($parts);

Output:

Array
(
    [scheme] => http
    [host] => youtube.com
    [path] => /watch
    [query] => v=VIDEOID
)

Validating the domain name and extracting the video ID is left as an exercise to the reader.


I gave in to the comment war below; thanks to Toni Oriol, the regular expression now works on short (youtu.be) URLs as well.

Linus Kleen
  • 33,871
  • 11
  • 91
  • 99
  • Wow, that is an amazing answer. I think it's going to take some looking at the manual to explain exactly why I works, but it's simply great! Thanks! – Luke Nov 20 '12 at 15:25
  • How to validate youtube link with text box - http://stackoverflow.com/questions/28735459/how-to-validate-you-tube-url-in-client-side-in-text-box – Hitesh Feb 26 '15 at 05:59
  • 1
    doesn't verify the following URL youtu.be/pmpqdwvzzzm – Muaaz Khalid Apr 06 '16 at 16:53
  • @muaaz Indeed. Then again, this question is almost four years old. One should expect things to change within periods of this magnitude. – Linus Kleen Apr 06 '16 at 16:57
  • ohh sorry, I didn't even notice :) anyway, do you or anyone have a modified solution? – Muaaz Khalid Apr 06 '16 at 16:59
  • 1
    @muaaz That's what "part 2" of this answer is all about. The video ID should be the `path` part of the resulting array. Did you even read this answer? – Linus Kleen Apr 06 '16 at 19:30
  • No one cares if its 4 years old, or if part 2 is about that (and no, part 2 its about extracting the values). It's simply wrong. I tried to edit to fix it but it seems that it disappeared mysteriously. – Toni Oriol Sep 29 '16 at 13:58
  • @ToniOriol Your tone... It could be friendlier, friend. Your edit hasn't _magically_ vanished; it simply [failed its review](http://stackoverflow.com/review/suggested-edits/13818967). And might I ask you the same question as asked before: _Did you even read this answer?_ It says: *The video ID should be the path part of the resulting array.* This case applies even more so when subjecting the `youtu.be/video_id` URL to `parse_url()`: the video ID is _all_ that's in "path". – Linus Kleen Sep 29 '16 at 14:31
  • Sorry @LinusKleen, i apologize if i offended you. But I could tell you the same: i think my comment is as offensive as you suggesting that @muaaz (and now me) didn't even read your answer. I'm sorry, but the answer (at least half of it) its wrong. And its not an excuse that the answer is old, the format of the shortened version of the youtube url never included the `/watch?v=XXXXXXXXXXX` it always has been `youtu.be/XXXXXXXXXXX`. And please, don't get me wrong, i don't want to create a conflict with you here. – Toni Oriol Sep 29 '16 at 16:04
  • @ToniOriol agreed. Linus is acting confused. Stackoverflow is not "go f*** off and solve it yourself". This answer is wrong, it does not match `https://youtu.be/U4JR1IuKkJM`. Yes, that means i had to figure out myself that the answer is only half a**ed and incomplete, and for me to complete. OKAY LINUS, THANKS ANYWAY. – Toskan Feb 01 '17 at 02:02
  • oh and a solution is btw this `~ ^(?:https?://)? # Optional protocol (?:www\.)? # Optional subdomain (?:youtube\.com|youtu\.be) # Mandatory domain name (?:/watch\?v=([^&]+)|/([^\?]+)) # URI with video id as capture group 1 ~x` this will save in `matches[1]` (www version) or `matches[2]` (youtu.be version) – Toskan Feb 01 '17 at 02:14
  • 1
    thats right @Toskan, this is exactly how my suggested edit looks like :) – Toni Oriol Feb 01 '17 at 12:50
  • Nice @LinusKleen :) – Toni Oriol Feb 27 '17 at 16:15
5

An alternative to Regular Expressions would be parse_url().

 $parts = parse_url($url);
 if ($parts['host'] == 'youtube.com' && ...) {
   // your code
 }

While it is more code, it is more readable and therefore more maintainable.

Jason McCreary
  • 71,546
  • 23
  • 135
  • 174
4

Please try:

// Set the youtube URL
$youtube_url = "www.youtube.com/watch?v=vpfzjcCzdtCk";

if (preg_match("/^((http\:\/\/){0,}(www\.){0,}(youtube\.com){1}|(youtu\.be){1}(\/watch\?v\=[^\s]){1})$/", $youtube_url) == 1)
{
    echo "Valid";
}
else
{
    echo "Invalid";
}

You had || which is ok without ^$ in any case.

eisberg
  • 3,731
  • 2
  • 27
  • 38
3

This should do it:

$valid = preg_match("/^(https?\:\/\/)?(www\.)?(youtube\.com|youtu\.be)\/watch\?v\=\w+$/", $youtube_url);
if ($valid) {
    echo "Valid";
} else {
    echo "Invalid";
}
Steven Moseley
  • 15,871
  • 4
  • 39
  • 50
2

I defer to the other answers on this page for parsing the URL syntax, but for the YouTube ID values themselves, you can be a little bit more specific, as I describe in the following answer on StackExchange/WebApps:

Format for ID of YouTube video   -    https://webapps.stackexchange.com/a/101153/141734


Video Id

For the videoId, it is an 8-byte (64-bit) integer. Applying Base64-encoding to 8 bytes of data requires 11 characters. However, since each Base64 character conveys exactly 6 bits, this allocation could actually hold up to 11 × 6 = 66 bits--a surplus of 2 bits over what our payload needs. The excess bits are set to zero, which has the effect of excluding certain characters from ever appearing in the last position of the encoded string. In particular, the videoId will always end with one of the following:

{ A, E, I, M, Q, U, Y, c, g, k, o, s, w, 0, 4, 8 }

Thus, a regular expression (RegEx) for the videoId would be as follows:

[-_A-Za-z0-9]{10}[AEIMQUYcgkosw048]

Channel or Playlist Id

The channelId and playlistId strings are produced by Base64-encoding a 128-bit (16-byte) binary integer. Again here, calculation per Base64 correctly predicts the observed string length of 22-characters. In this case, the output is capable of encoding 22 × 6 = 132 bits, a surplus of 4 bits; those zeros end up restricting most of the 64 alphabet symbols from appearing in the last position, and only 4 remain eligible. All channelId strings end in one of the following:

{ A, Q, g, w }

This gives us the regular expression for a channelId:

[-_A-Za-z0-9]{21}[AQgw]
Community
  • 1
  • 1
Glenn Slayden
  • 17,543
  • 3
  • 114
  • 108
  • Thanks for adding this additional information Glenn! Therefore a more specific version of the resex would be https://regex101.com/r/pveXvY/1 – Luke Feb 02 '17 at 10:31