How do i truncate url using python

Question

How do i truncate the below URL next to the domain "com" using python. i.e you tube.com only

    youtube.com/video/AiL6nL
    yahoo.com/video/Hhj9B2
    youtube.com/video/MpVHQ
    google.com/video/PGuTN
    youtube.com/video/VU34MI

Is it possible to truncate like this?

Ewan · Accepted Answer · 2013-06-09T06:36:58.770

6

Check out Pythons urlparse library. It is a standard library so nothing else needs to be installed.

So you could do the following:

import urlparse
import re

def check_and_add_http(url):
    # checks if 'http://' is present at the start of the URL and adds it if not.
    http_regex = re.compile(r'^http[s]?://')
    if http_regex.match(url):
        # 'http://' or 'https://' is present
        return url
    else:
        # add 'http://' for urlparse to work.
        return 'http://' + url

for url in url_list:
    url = check_and_add_http(url)
    print(urlparse.urlsplit(url)[1])

You can read more about urlsplit() in the documentation, including the indexes if you want to read the other parts of the URL.

edited Jun 09 '13 at 06:36

answered Jun 07 '13 at 11:53

Ewan

14,592
6
48
62

3

Does it really work even without scheme part? I get empty strings. – alecxe Jun 07 '13 at 11:56
from urlparse import urlparse url = urlparse('http://www.youtube.com/video/wpmkqYRfVkk') print "url = " + str (url) – Brisi Jun 07 '13 at 12:16
@alecxe: indeed, `urlsplit()` doesn't work in this case (because `http://` part is missing in the input): `urlsplit("youtube.com/video/AiL6nL")` -> `SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL', query='', fragment='')` – jfs Jun 09 '13 at 01:23
updated to check for schema and add a `http://` if not present to make parsing easier – Ewan Jun 09 '13 at 06:47

score 4 · Answer 2 · answered Jun 07 '13 at 11:51

4

You can use split():

myUrl.split(r"/")[0]

to get "youtube.com"

and:

myUrl.split(r"/", 1)[1]

to get everything else

answered Jun 07 '13 at 11:51

mishik

9,973
9
45
67

you could use `.partition('/')[0]` – jfs Jun 09 '13 at 01:20

score 1 · Answer 3 · answered Jun 07 '13 at 12:09

1

I'd use the function urlsplit from the standard library:

from urlparse import urlsplit # python 2
from urllib.parse import urlsplit # python 3

myurl = "http://docs.python.org/2/library/urlparse.html"
urlsplit(myurl)[1] # returns 'docs.python.org'

answered Jun 07 '13 at 12:09

ojdo

8,280
5
37
60

score 0 · Answer 4 · answered Jun 07 '13 at 12:01

No library function can tell that those strings are supposed to be absolute URLs, since, formally, they are relative ones. So, you have to prepend //.

>>> url = 'youtube.com/bla/foo'
>>> urlparse.urlsplit('//' + url)[1]
                 > 'youtube.com'

score 0 · Answer 5 · answered Jun 07 '13 at 12:04

0

Just a crazy alternative solution using tldextract:

>>> import tldextract
>>> ext = tldextract.extract('youtube.com/video/AiL6nL')
>>> ".".join(ext[1:3])
'youtube.com'

answered Jun 07 '13 at 12:04

alecxe

462,703
120
1,088
1,195

score 0 · Answer 6 · edited Oct 07 '21 at 06:16

For your particular input, you could use str.partition() or str.split():

print('youtube.com/video/AiL6nL'.partition('/')[0])
# -> youtube.com

Note: urlparse module (that you could use in general to parse an url) doesn't work in this case:

import urlparse

urlparse.urlsplit('youtube.com/video/AiL6nL')
# -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL',
#                query='', fragment='')

In general, it is safe to use a regex here if you know that all lines start with a hostname and otherwise each line contains a well-formed uri:

import re

print("\n".join(re.findall(r"(?m)^\s*([^\/?#]*)", text)))

Output

youtube.com
yahoo.com
youtube.com
google.com
youtube.com

Note: it doesn't remove the optional port part -- host:port.

How do i truncate url using python

6 Answers6

Output