How can I extract the website name from strings like these in Python?

Question

I am pretty new in Python and I have the following problem.

I have a variable named target_url containing strings representing URL, so it contains something like:

http_s://stackoverflow.com

or: htt_p://myhttpurl.com

or:

http_s://www.mysite.com

(I inserted the _s and the _p in the URL name because stackoverflow give me error)

From a string like this I want to extract only the sitename, so related to the previous case these substrings:

stackoverflow.com
myhttpurl.com
mysite.com

What could be a smart way to achieve this task?

Have a look into regex, via the built-in ‘re’ Python library. — S3DEV, Apr 28 '20 at 19:05
Try this [regex](https://stackoverflow.com/a/31952097/7695722) example. It is very comprehensive. — Brayoni, Apr 28 '20 at 19:12

score 1 · Answer 1 · answered Apr 28 '20 at 19:02

1

s = 'http_s://stackoverflow.com'
s.split("//")[-1]
#'stackoverflow.com'

answered Apr 28 '20 at 19:02

SuperCiocia

1,823
6
23
40

score 1 · Answer 2 · answered Apr 28 '20 at 19:03

1

You can use the split() function like this:-

'http_s://stackoverflow.com'.split('//')[-1]

Output:-

'stackoverflow.com'

answered Apr 28 '20 at 19:03

Dhaval Taunk

1,662
1
9
17

score 1 · Answer 3 · answered Apr 28 '20 at 19:29

Use regular expressions to capture the information we want. Depending on which data object the data is stored in, and how the task is processed within the larger workflow, we can implement regular expressions a few different ways (Can look into further if needed).

To start out with, we’ll build a pattern that matches the string you’re looking for and extracts the section you want.

# regular expression library
import re

# expression pattern as p
p = ‘https*://(.+\.com)’

# input string as a
s = ‘https://www.stackoverflow.com’

# regular expression if conditional that captures the match within the parentheses
if re.search(p, s) is not None:
    m = re.search(p, s)
    print(m.group(1))

Returns:

www.stackoverflow.com

A couple notes on this code:

Note that this expression uses re.search; re.search scans the entire input string for the first instance of the match and then returns it. If we needed to match multiple returns with one pattern, we would need a different re method.

The capture occurs with two parts: First, the parentheses in the expression pattern form a capture group. Second, the capture group is returned by calling the .group(1) method of the re match object (which is the m above). If we print the .group(0) method, then it will return the entire string match.

Let me know if this works, and we can look at implementation if needed. Hope this helps!

How can I extract the website name from strings like these in Python?

3 Answers3