1

If I have a string the is always preceded by http://, and optionally folowed by /. Example:

http://www.mymovies.com/

But sometimes can be in the format: http://www.mymovies.com

I want to extract www.mymoviews.com I want to capture both format (with/without the /)

I tried using:

import re
print(re.search('http://(.*)/','http://www.mymovies.com').group(1))

But I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

1) How to solve the error 2) How to capture both with/without the following / character (as my solution requires /

user9371654
  • 2,160
  • 16
  • 45
  • 78
  • `re.search('www.+com',s).group()` – Igor Dragushhak Feb 28 '19 at 20:31
  • It is not always `www.` my fixed character is `http://` as I stated in the question. – user9371654 Feb 28 '19 at 20:33
  • 1
    Try `http://([^/]*)/?`, see [this regex demo](https://regex101.com/r/aKWdre/1) – Wiktor Stribiżew Feb 28 '19 at 20:39
  • @Wiktor Stribiżew How to try it? using re? Can you write the full line plz. – user9371654 Feb 28 '19 at 20:42
  • Yes, `re` is enough. `print(re.search(r'http://([^/]*)/?','http://www.mymovies.com').group(1))` and `print(re.search(r'http://([^/]*)/?','http://www.mymovies.com/').group(1))`. I do not know what other types of URLs you want to match, thus, it is a suggestion. – Wiktor Stribiżew Feb 28 '19 at 20:42
  • @Wiktor Stribiżew Why small r before the string? – user9371654 Feb 28 '19 at 20:43
  • See [What exactly is a “raw string regex” and how can you use it?](https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it). In such a string literal, all ``\`` are treated as literal backslashes. [Further reading](https://stackoverflow.com/questions/28334871/why-do-python-regex-strings-sometimes-work-without-using-raw-strings) if you are intrigued. Well, consider `r` a best practice when defining regex patterns in Python. – Wiktor Stribiżew Feb 28 '19 at 20:44
  • Do you have any more test cases? Any set of rules of what kind of input the regex should match or avoid matching? – Wiktor Stribiżew Feb 28 '19 at 20:48
  • @Wiktor Stribiżew yours is correct answer. The error appears because the string starts with `http://` and my actual code was `https://`. To capture both `http` and `https://` I did it like this: `http(s?)://([^/]*)/?` as some strings may have `http://` while others may have `https://` – user9371654 Feb 28 '19 at 20:57
  • See [my answer below](https://stackoverflow.com/a/54934446/3832970). – Wiktor Stribiżew Feb 28 '19 at 21:23

4 Answers4

1

You may use

m = re.search(r'https?://([^/]*)/?','http://www.mymovies.com')
if m:
    print(m.group(1))

See the regex demo

Details

  • http - http substring
  • s? - 1 or 0 s chars
  • :// - a :// substring
  • ([^/]*) - Capturing group 1: zero or more chars other than /
  • /? - 1 or 0 / chars.

Python demo (prints four www.mymovies.com as output):

import re
strs = ['http://www.mymovies.com/','http://www.mymovies.com','https://www.mymovies.com/','https://www.mymovies.com']
r = re.compile(r'https?://([^/]*)/?')
for s in strs:
    m = r.search('http://www.mymovies.com')
    if m:
        print(m.group(1))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Your search string is http://(.*)/, so the / at the end is obligatory. If you put a ? after it you make it optional, or you can just leave it out completely. If you don't want it to be part of the resulting string, either restrict the matched characters before it to everything but /:

https://([^/]*)

or do a simple last-character-check after the operation and remove it if it's a /:

if result[-1] == "/": result = result[:-1]

It should also be noted that if your input can be full URLs (including paths and ?key=value pairs), you should restrict the matched characters further.

Krateng
  • 388
  • 1
  • 3
  • 13
0

Try Regex: (?<=http:\/\/)\[^\/\]+?(?=\/|$)

Demo

Matt.G
  • 3,586
  • 2
  • 10
  • 23
0

You could do it without regular expressions using the split() method:

url.split("/")[2]

'http://www.mymovies.com/'.split("/")[2] ==> "www.mymovies.com"

'http://www.mymovies.com'.split("/")[2] ==> "www.mymovies.com"

'http://www.mymovies.com/star-wars/episodeV'.split("/")[2] ==> "www.mymovies.com"
Alain T.
  • 40,517
  • 4
  • 31
  • 51