How to extract just domain names from urls?

Question

I have a following list of URLs:

urls = ["http://arxiv.org/pdf/1611.08097", "https://doi.org/10.1109/tkde.2016.2598561", "https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85116544648&origin=inward"]

from each element of the list, I am trying to extract just the domain names like: arxiv, doi, scopus.

For that I have a code:

import re

for url in urls:
    print(re.search('https?://([A-Za-z_0-9.-]+).*', url).group(1))

The output of print:

arxiv.org
doi.org
www.scopus.com

How can I modify the above regex to extract just the domain and no other stuff like www., .com, .org etc?

Thanks in advance.

Ok, I [modified your regex](https://stackoverflow.com/a/70188031/16343464) — mozway, Dec 01 '21 at 16:42

score 2 · Accepted Answer · answered Dec 01 '21 at 16:29

2

You can remove the dot from the character class and make www. optional. The value is in capture group 1.

https?://(?:www\.)?([A-Za-z_0-9-]+)

Regex demo

answered Dec 01 '21 at 16:29

The fourth bird

154,723
16
55
70

mozway · Answer 2 · 2021-12-01T16:47:47.353

To get only the second to last chunks of the domain, you could modify your regex to have:

[re.search('https?://(?:[^/]+\.)?([A-Za-z_0-9-]+)\.[^/.]+(?:/.*)?', url).group(1)
 for url in urls]

Output:

['arxiv', 'doi', 'scopus']

urllib

@AbdulNiyasPM had a nice answer, too bad it was deleted, you can modify it to get what you want:

from urllib.parse import urlparse
[urlparse(url).hostname.split('.')[-2]
 for url in urls]

How to extract just domain names from urls?

2 Answers2

urllib