3

How do i truncate the below URL next to the domain "com" using python. i.e you tube.com only

    youtube.com/video/AiL6nL
    yahoo.com/video/Hhj9B2
    youtube.com/video/MpVHQ
    google.com/video/PGuTN
    youtube.com/video/VU34MI

Is it possible to truncate like this?

Brisi
  • 1,781
  • 7
  • 26
  • 41

6 Answers6

6

Check out Pythons urlparse library. It is a standard library so nothing else needs to be installed.

So you could do the following:

import urlparse
import re

def check_and_add_http(url):
    # checks if 'http://' is present at the start of the URL and adds it if not.
    http_regex = re.compile(r'^http[s]?://')
    if http_regex.match(url):
        # 'http://' or 'https://' is present
        return url
    else:
        # add 'http://' for urlparse to work.
        return 'http://' + url

for url in url_list:
    url = check_and_add_http(url)
    print(urlparse.urlsplit(url)[1])

You can read more about urlsplit() in the documentation, including the indexes if you want to read the other parts of the URL.

Ewan
  • 14,592
  • 6
  • 48
  • 62
  • 3
    Does it really work even without scheme part? I get empty strings. – alecxe Jun 07 '13 at 11:56
  • from urlparse import urlparse url = urlparse('http://www.youtube.com/video/wpmkqYRfVkk') print "url = " + str (url) – Brisi Jun 07 '13 at 12:16
  • @alecxe: indeed, `urlsplit()` doesn't work in this case (because `http://` part is missing in the input): `urlsplit("youtube.com/video/AiL6nL")` -> `SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL', query='', fragment='')` – jfs Jun 09 '13 at 01:23
  • updated to check for schema and add a `http://` if not present to make parsing easier – Ewan Jun 09 '13 at 06:47
4

You can use split():

myUrl.split(r"/")[0]

to get "youtube.com"

and:

myUrl.split(r"/", 1)[1]

to get everything else

mishik
  • 9,973
  • 9
  • 45
  • 67
1

I'd use the function urlsplit from the standard library:

from urlparse import urlsplit # python 2
from urllib.parse import urlsplit # python 3

myurl = "http://docs.python.org/2/library/urlparse.html"
urlsplit(myurl)[1] # returns 'docs.python.org'
ojdo
  • 8,280
  • 5
  • 37
  • 60
0

No library function can tell that those strings are supposed to be absolute URLs, since, formally, they are relative ones. So, you have to prepend //.

>>> url = 'youtube.com/bla/foo'
>>> urlparse.urlsplit('//' + url)[1]
                 > 'youtube.com'
kirelagin
  • 13,248
  • 2
  • 42
  • 57
0

Just a crazy alternative solution using tldextract:

>>> import tldextract
>>> ext = tldextract.extract('youtube.com/video/AiL6nL')
>>> ".".join(ext[1:3])
'youtube.com'
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

For your particular input, you could use str.partition() or str.split():

print('youtube.com/video/AiL6nL'.partition('/')[0])
# -> youtube.com

Note: urlparse module (that you could use in general to parse an url) doesn't work in this case:

import urlparse

urlparse.urlsplit('youtube.com/video/AiL6nL')
# -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL',
#                query='', fragment='')

In general, it is safe to use a regex here if you know that all lines start with a hostname and otherwise each line contains a well-formed uri:

import re

print("\n".join(re.findall(r"(?m)^\s*([^\/?#]*)", text)))

Output

youtube.com
yahoo.com
youtube.com
google.com
youtube.com

Note: it doesn't remove the optional port part -- host:port.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670