1

I have a string format with URLs. For example: "first string url is example.com/directory, second URL is http:///example.com/directory and 3rd is www.example.com/directory"

I want to match my regex exactly for "example.com/directory" without http and www

I am trying the following regex but this is also working for http, https and www.

(\S+)(?:com|net|[/])[/](\S+|$)
Umar Tanveer
  • 77
  • 2
  • 8

3 Answers3

1

Don't use a regex if you can, see if you can parse the url with a dedicated library

This will also help with other TLDs, such as .net, .org, .club.

>>> import urllib.parse
>>> urls = ("https://www.example.com/directory", "www.example.com/directory", "example.com/directory")
>>> for url in urls:
...     print(urllib.parse.urlparse("http://" + url.split("//")[-1]))
...
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='www.example.com', path='/directory', params='', query='', fragment='')
ParseResult(scheme='http', netloc='example.com', path='/directory', params='', query='', fragment='')

To get just the top and second-level domain, you could just split() the netloc

>>> urllib.parse.urlparse("http://whatever.example.com").netloc.split(".")[-2:]
['example', 'com']
ti7
  • 16,375
  • 6
  • 40
  • 68
  • I understand your point but how would I detect that example.com/directory as url and pass to urls (tuple) but https://www.example.com/directory and www.example.com/directory can be detectable. – Umar Tanveer Jul 30 '21 at 18:45
  • you can detect that it has a `.netloc` and `.path` field, take this as being valid (or at least not invalid), and _then_ split out the real top and second level domains (perhaps `.split(".")[-2:]`?) – ti7 Jul 30 '21 at 18:49
0

To only allow for http:// and www. prefix (only one, if any). Then you can use optional alternatives:

^(?:http:\/\/|www\.)?(\w+\.(?:com|net)\/directory)$

Try it out here: https://regex101.com/r/aPtYhc/1

We use a capturing group to only capture the "example.com/directory" of the URL. This means that the RegEx will not capture everything it matches.

zr0gravity7
  • 2,917
  • 1
  • 12
  • 33
  • Can we match just first one? (example.com/directory) – Umar Tanveer Jul 30 '21 at 18:19
  • Not sure what you mean. If you want a RegEx to only match "example.com/directory", then it is this one: `example\.(?:com|net)\/directory`. If you want a RegEx that can ensure that the entire string starts with one of the prefixes (`http://` or `www.`) and then extract the string like "example.com/directory" that follows, then you must use a capture group. This means that your RegEx will match more than you want to extract, but you can simply use the capture group to extract what you want. – zr0gravity7 Jul 30 '21 at 18:24
0

This regEx might help.

1 www.example.com/directory match:example.com/directory

2 http://example.com/directory match:example.com/directory

3 example.com/directory match:example.com/directory

4 example.net/directory match:example.net/directory

(?<=www.)[\w\.\/]+|(?<=http:\/\/)[\w\.\/]+|\w+.com\/\w+|\w+.net\/\w+

You could check regEx online here.

https://regex101.com/r/pEmcJP/1

nknk
  • 59
  • 5