6

I'm making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.

Currently I'm using the following regex to change 'http://www.youtube.com/watch?v=-JyZLS2IhkQ' into displaying the corresponding youtube video:

return re.compile('(http(s|):\/\/|)(www.|)youtube.(com|nl)\/watch\?v\=([a-zA-Z0-9-_=]+)').sub(tag, value)

(where the variable 'tag' is a bit of html so the video works and 'value' a user post)

Now this works.. until the url is like this:

'http://www.youtube.com/watch?v=-JyZLS2IhkQ&feature...'

Now I'm hoping you guys could help me figure how to also match the '&feature...' part so it disappears.

Example HTML:

No replies to this post..

Youtube vid:

http://www.youtube.com/watch?v=-JyZLS2IhkQ

More blabla

Thanks for your thoughts, much appreciated

Stefan

Frits
  • 71
  • 1
  • 3
  • 3
    your regex is quite atrocious :) –  Jan 16 '11 at 15:12
  • wait what? are you trying to _find_ a youtube link buried in some html code? i had a hard time parsing that from your question! –  Jan 16 '11 at 15:23
  • I'm sorry for the bad question, I changed te post, hopefully it's more clear now. – Frits Jan 16 '11 at 16:37
  • About the atrocious regex, how to improve it? – Frits Jan 16 '11 at 16:39
  • your example is not really html and you don't tell us what can be expected from value. if value is user-supplied, you'll run into all kinds of trouble. –  Jan 16 '11 at 18:27
  • You should also account for a url like http://youtu.be/IytNBm8WA1c – Kenzic Sep 19 '12 at 19:55

4 Answers4

6

Here how I'm solving it:

import re

def youtube_url_validation(url):
    youtube_regex = (
        r'(https?://)?(www\.)?'
        '(youtube|youtu|youtube-nocookie)\.(com|be)/'
        '(watch\?v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')

    youtube_regex_match = re.match(youtube_regex, url)
    if youtube_regex_match:
        return youtube_regex_match

    return youtube_regex_match

TESTS:

youtube_urls_test = [
    'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
    'http://youtu.be/5Y6HSHwhVlY', 
    'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
    'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
    'http://www.youtube.com/',
    'http://www.youtube.com/?feature=ytca']


for url in youtube_urls_test:
    m = youtube_url_validation(url)
    if m:
        print('OK {}'.format(url))
        print(m.groups())
        print(m.group(6))
    else:
        print('FAIL {}'.format(url))
Moreno
  • 1,567
  • 1
  • 12
  • 15
  • 2
    To match URLs like `http://www.youtube.com/watch?feature=player_detailpage&v=QemTZn8YfJ0#t=46s` I edited your regex to `youtube_regex = ( r'(https?://)?(www\.)?' '(youtube|youtu|youtube-nocookie)\.(com|be)/' '(watch\?.*?(?=v=)v=|embed/|v/|.+\?v=)?([^&=%\?]{11})')` – Christoph Dwertmann May 27 '15 at 03:36
5

You should specify your regular expressions as raw strings.

You don't have to escape every character that looks special, just the ones which are.

Instead of specifying an empty branch ((foo|)) to make something optional, you can use ?.

If you want to include - in a character set, you have to escape it or put it at right after the opening bracket.

You can use special character sets like \w (equals [a-zA-Z0-9_]) to shorten your regex.

r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([-\w]+)'

Now, in order to match the whole URL, you have to think about what can or cannot follow it in the input. Then you put that into a lookahead group (you don't want to consume it).

In this example I took everything except -, =, %, & and alphanumerical characters to end the URL (too lazy to think about it any harder).

Everything between the v-argument and the end of the URL is non-greedily consumed by .*?.

r'(https?://)?(www\.)?youtube\.(com|nl)/watch\?v=([\w-]+)(&.*?)?(?=[^-\w&=%])'

Still, I would not put too much faith into this general solution. User input is notoriously hard to parse robustly.

3

What if you used the urlparse module to pick apart the youtube address you find and put it back into the format you want? You could then simplify your regex so that it only finds the entire url and then use urlparse to do the heavy lifting of picking it apart for you.

from urlparse import urlparse,parse_qs,urlunparse
from urllib import urlencode
youtube_url = urlparse('http://www.youtube.com/watch?v=aFNzk7TVUeY&feature=grec_index')
params = parse_qs(youtube_url.query)
new_params = {'v': params['v'][0]}

cleaned_youtube_url = urlunparse((youtube_url.scheme, \
                                  youtube_url.netloc, \
                                  youtube_url.path,
                                  None, \
                                  urlencode(new_params), \
                                  youtube_url.fragment))

It's a bit more code, but it allows you to avoid regex madness.

And as hop said, you should use raw strings for the regex.

seggy
  • 434
  • 2
  • 7
  • I gave (and deleted) the same answer, but the problem as asked is to actually _find_ the url, not parse it. –  Jan 17 '11 at 08:50
  • Well, he wants to do both. He wants to find the url and parse it (because he needs to get rid of part of the query string). My suggestion was to find the url first, which his regex already does according to the question. And then use some already existing code to pick it apart and strip out what he doesn't want or need. – seggy Jan 17 '11 at 23:44
0

Here's how I implemented it in my script:

string = "Hey, check out this video: https://www.youtube.com/watch?v=bS5P_LAqiVg"

youtube = re.findall(r'(https?://)?(www\.)?((youtube\.(com))/watch\?v=([-\w]+)|youtu\.be/([-\w]+))', string)

if youtube:
    print youtube

That outputs:

["", "youtube.com/watch?v=BS5P_LAqiVg", ".com", "watch", "com", "bS5P_LAqiVg", ""]

If you just wanted to grab the video id, for example, you would do:

video_id = [c for c in youtube[0] if c] # Get rid of empty list objects
video_id = video_id[len(video_id)-1] # Return the last item in the list
kn0wmad1c
  • 110
  • 10