How to extract a string always comes after specific string and optionally followed by a string

Question

If I have a string the is always preceded by http://, and optionally folowed by /. Example:

http://www.mymovies.com/

But sometimes can be in the format: http://www.mymovies.com

I want to extract www.mymoviews.com I want to capture both format (with/without the /)

I tried using:

import re
print(re.search('http://(.*)/','http://www.mymovies.com').group(1))

But I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

1) How to solve the error 2) How to capture both with/without the following / character (as my solution requires /

It is not always `www.` my fixed character is `http://` as I stated in the question. — user9371654, Feb 28 '19 at 20:33
Try `http://([^/]*)/?`, see [this regex demo](https://regex101.com/r/aKWdre/1) — Wiktor Stribiżew, Feb 28 '19 at 20:39
@Wiktor Stribiżew How to try it? using re? Can you write the full line plz. — user9371654, Feb 28 '19 at 20:42
Yes, `re` is enough. `print(re.search(r'http://([^/]*)/?','http://www.mymovies.com').group(1))` and `print(re.search(r'http://([^/]*)/?','http://www.mymovies.com/').group(1))`. I do not know what other types of URLs you want to match, thus, it is a suggestion. — Wiktor Stribiżew, Feb 28 '19 at 20:42
See [What exactly is a “raw string regex” and how can you use it?](https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it). In such a string literal, all ``\`` are treated as literal backslashes. [Further reading](https://stackoverflow.com/questions/28334871/why-do-python-regex-strings-sometimes-work-without-using-raw-strings) if you are intrigued. Well, consider `r` a best practice when defining regex patterns in Python. — Wiktor Stribiżew, Feb 28 '19 at 20:44
Do you have any more test cases? Any set of rules of what kind of input the regex should match or avoid matching? — Wiktor Stribiżew, Feb 28 '19 at 20:48
@Wiktor Stribiżew yours is correct answer. The error appears because the string starts with `http://` and my actual code was `https://`. To capture both `http` and `https://` I did it like this: `http(s?)://([^/]*)/?` as some strings may have `http://` while others may have `https://` — user9371654, Feb 28 '19 at 20:57
See [my answer below](https://stackoverflow.com/a/54934446/3832970). — Wiktor Stribiżew, Feb 28 '19 at 21:23

score 1 · Accepted Answer · answered Feb 28 '19 at 21:19

1

You may use

m = re.search(r'https?://([^/]*)/?','http://www.mymovies.com')
if m:
    print(m.group(1))

See the regex demo

Details

http - http substring
s? - 1 or 0 s chars
:// - a :// substring
([^/]*) - Capturing group 1: zero or more chars other than /
/? - 1 or 0 / chars.

Python demo (prints four www.mymovies.com as output):

import re
strs = ['http://www.mymovies.com/','http://www.mymovies.com','https://www.mymovies.com/','https://www.mymovies.com']
r = re.compile(r'https?://([^/]*)/?')
for s in strs:
    m = r.search('http://www.mymovies.com')
    if m:
        print(m.group(1))

answered Feb 28 '19 at 21:19

Wiktor Stribiżew

607,720
39
448
563

Do you mean http - https substring (you forgot 's') – user9371654 Mar 02 '19 at 19:51
@user9371654 I have not forgotten `s`, see `https?` in the pattern: it matches `http` or `https` since `s?` matches an optional `s`, i.e. 1 or 0 `s` chars. – Wiktor Stribiżew Mar 02 '19 at 20:49
can u double check the first bullet point? this is what I mean. – user9371654 Mar 02 '19 at 21:38
@user9371654 It is fine, since it is followed with the second bullet point. Please see the pattern in its entirety. – Wiktor Stribiżew Mar 02 '19 at 21:39
Now I got what you mean. Sorry. – user9371654 Mar 02 '19 at 21:41

Krateng · Answer 2 · 2019-02-28T21:00:15.267

Your search string is http://(.*)/, so the / at the end is obligatory. If you put a ? after it you make it optional, or you can just leave it out completely. If you don't want it to be part of the resulting string, either restrict the matched characters before it to everything but /:

https://([^/]*)

or do a simple last-character-check after the operation and remove it if it's a /:

if result[-1] == "/": result = result[:-1]

It should also be noted that if your input can be full URLs (including paths and ?key=value pairs), you should restrict the matched characters further.

score 0 · Answer 3 · answered Feb 28 '19 at 20:40

0

Try Regex: (?<=http:\/\/)\[^\/\]+?(?=\/|$)

Demo

answered Feb 28 '19 at 20:40

Matt.G

3,586
2
10
23

score 0 · Answer 4 · answered Feb 28 '19 at 23:49

You could do it without regular expressions using the split() method:

url.split("/")[2]

'http://www.mymovies.com/'.split("/")[2] ==> "www.mymovies.com"

'http://www.mymovies.com'.split("/")[2] ==> "www.mymovies.com"

'http://www.mymovies.com/star-wars/episodeV'.split("/")[2] ==> "www.mymovies.com"

How to extract a string always comes after specific string and optionally followed by a string

4 Answers4