How can I match my regex for URL (example.net/directory) without HTTP, HTPPS and WWW?

Question

I have a string format with URLs. For example: "first string url is example.com/directory, second URL is http:///example.com/directory and 3rd is www.example.com/directory"

I want to match my regex exactly for "example.com/directory" without http and www

I am trying the following regex but this is also working for http, https and www.

(\S+)(?:com|net|[/])[/](\S+|$)

https://regexr.com/ is super useful resource to design and test regex — Ricardo, Jul 30 '21 at 18:10
Yes I am using the same tool but unable to create regex for url without http and www https://regexr.com/62s2o — Umar Tanveer, Jul 30 '21 at 18:12

ti7 · Accepted Answer · 2021-07-30T18:51:47.663

1

Don't use a regex if you can, see if you can parse the url with a dedicated library

This will also help with other TLDs, such as .net, .org, .club.

>>> import urllib.parse
>>> urls = ("https://www.example.com/directory", "www.example.com/directory", "example.com/directory")
>>> for url in urls:
...     print(urllib.parse.urlparse("http://" + url.split("//")[-1]))
...
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='example.com', path='/directory', params='', query='', fragment='')

To get just the top and second-level domain, you could just split() the netloc

>>> urllib.parse.urlparse("http://whatever.example.com").netloc.split(".")[-2:]
['example', 'com']

edited Jul 30 '21 at 18:51

answered Jul 30 '21 at 18:23

ti7

16,375
6
40
68

I understand your point but how would I detect that example.com/directory as url and pass to urls (tuple) but https://www.example.com/directory and www.example.com/directory can be detectable. – Umar Tanveer Jul 30 '21 at 18:45
you can detect that it has a `.netloc` and `.path` field, take this as being valid (or at least not invalid), and _then_ split out the real top and second level domains (perhaps `.split(".")[-2:]`?) – ti7 Jul 30 '21 at 18:49

score 0 · Answer 2 · answered Jul 30 '21 at 18:13

0

To only allow for http:// and www. prefix (only one, if any). Then you can use optional alternatives:

^(?:http:\/\/|www\.)?(\w+\.(?:com|net)\/directory)$

Try it out here: https://regex101.com/r/aPtYhc/1

We use a capturing group to only capture the "example.com/directory" of the URL. This means that the RegEx will not capture everything it matches.

answered Jul 30 '21 at 18:13

zr0gravity7

2,917
1
12
33

Can we match just first one? (example.com/directory) – Umar Tanveer Jul 30 '21 at 18:19
Not sure what you mean. If you want a RegEx to only match "example.com/directory", then it is this one: `example\.(?:com|net)\/directory`. If you want a RegEx that can ensure that the entire string starts with one of the prefixes (`http://` or `www.`) and then extract the string like "example.com/directory" that follows, then you must use a capture group. This means that your RegEx will match more than you want to extract, but you can simply use the capture group to extract what you want. – zr0gravity7 Jul 30 '21 at 18:24

nknk · Answer 3 · 2021-07-30T18:47:40.360

0

This regEx might help.

1 www.example.com/directory match:example.com/directory

2 http://example.com/directory match:example.com/directory

3 example.com/directory match:example.com/directory

4 example.net/directory match:example.net/directory

(?<=www.)[\w\.\/]+|(?<=http:\/\/)[\w\.\/]+|\w+.com\/\w+|\w+.net\/\w+

You could check regEx online here.

https://regex101.com/r/pEmcJP/1

edited Jul 30 '21 at 18:47

answered Jul 30 '21 at 18:17

nknk

59
5

How can I match my regex for URL (example.net/directory) without HTTP, HTPPS and WWW?

3 Answers3