-3

I have a following list of URLs:

urls = ["http://arxiv.org/pdf/1611.08097", "https://doi.org/10.1109/tkde.2016.2598561", "https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85116544648&origin=inward"]

from each element of the list, I am trying to extract just the domain names like: arxiv, doi, scopus.

For that I have a code:

import re

for url in urls:
    print(re.search('https?://([A-Za-z_0-9.-]+).*', url).group(1))

The output of print:

arxiv.org
doi.org
www.scopus.com

How can I modify the above regex to extract just the domain and no other stuff like www., .com, .org etc?

Thanks in advance.

reinhardt
  • 1,873
  • 3
  • 9
  • 23

2 Answers2

2

You can remove the dot from the character class and make www. optional. The value is in capture group 1.

https?://(?:www\.)?([A-Za-z_0-9-]+)

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
2

To get only the second to last chunks of the domain, you could modify your regex to have:

[re.search('https?://(?:[^/]+\.)?([A-Za-z_0-9-]+)\.[^/.]+(?:/.*)?', url).group(1)
 for url in urls]

Output:

['arxiv', 'doi', 'scopus']
urllib

@AbdulNiyasPM had a nice answer, too bad it was deleted, you can modify it to get what you want:

from urllib.parse import urlparse
[urlparse(url).hostname.split('.')[-2]
 for url in urls]
mozway
  • 194,879
  • 13
  • 39
  • 75