15

I have such a regular expression:

 re.compile(r"((https?):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", re.MULTILINE|re.UNICODE)

But that doesn't include hashbangs (#!). What do I need to change to get it working? I know I can add ! to a group with #@%, etc., but that will select something like

Check this out: http://example.com/something/!!!

And I want to avoid that.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ThomK
  • 637
  • 2
  • 8
  • 21
  • 2
    How about checking out the RFC for URI syntax (http://www.ietf.org/rfc/rfc3986.txt)? It will show you that the bang can only be used in certain ways otherwise it has to be escaped. Good question. – Ray Toal Jul 16 '11 at 16:20
  • 1
    I hope you're not trying to use this regex to match URLs requested by a browser: if so, you should realise that the part after the hash is *not* sent in a normal client request. – Daniel Roseman Jul 16 '11 at 17:31
  • No. I'm parsing user input and make links shorter and safer for users (we have full control, we can block link, domain etc.). And with original regex there was http://ourshortdomain.foo/urlhash/#!/twitter/something ;) – ThomK Jul 17 '11 at 19:12
  • The canonical question is *[How can I split a URL string up into separate parts in Python?](https://stackoverflow.com/questions/449775/)* (2009). – Peter Mortensen Nov 28 '22 at 02:33

7 Answers7

21

Don't try to make your own regular expression for matching URLs. Use someone else's who has already solved such problems, like this one.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
kindall
  • 178,883
  • 35
  • 278
  • 309
  • 43
    While there's nothing wrong in using somebody's else code, there's nothing wrong in writing your own either! :) I think if everybody would follow the recommendation _"Don't try to make your own , use someone else's"_ we would still all be living in caves! ;) – mac Jul 16 '11 at 16:41
  • 3
    @mac - If everyone had to reinvent everything, we'd make progress much more slowly. Far better to use someone else's completed idea and then make it better by improving it or adding something new to it. Even Newton acknowledged that he was building on the foundation of others' work. – unpythonic Jul 16 '11 at 17:29
  • 1
    @Mark - I surely don't argue with that and I never said that _everybody_ should reinvent the wheel! :) I just hold that there is not an hard rule to follow: sometimes it make sense to use other's work, sometimes it doesn't. – mac Jul 16 '11 at 17:36
  • 1
    @mac - You're absolutely right. However, we should gently nudge those who write horrific regular expressions into copying others' work until they gain enough knowledge so as to not leave a nightmare of others to maintain. :) :) – unpythonic Jul 16 '11 at 17:43
  • The method in this link doesn't match some valid urls, specifically url shorteners. I'd put an example, but SO doesn't let me put shortened urls. But specifically, it doesn't work with Twitter's shortener, `'https:// + 't.co' + '/blah'`. – dboshardy Jul 05 '17 at 20:34
  • The updated one in Gruber's gist works with the t.co URLs. I'll update the link. – kindall Jul 06 '17 at 00:00
  • @kindall It turns out I made a goof. PyCharm for whatever reason, when you let it fix lines that are too long, and your line is a triple-quoted string, it inserts a space at the end of your string... – dboshardy Jul 10 '17 at 14:06
  • 4
    The regex in the link is terrible: it attempts to list year 2011 known Top Level Domains and becomes VERY quickly OBSOLETE. – Cœur Mar 21 '18 at 10:25
8

It could be very long but in practice mine works pretty good. Please try this one ((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*

It matches all of the example below

http://wwww.stackoverflow.com
abc.com
http://test.test-75.1474.stackoverflow.com/
stackoverflow.com/
stackoverflow.com
rfordyce@broadviewnet.com
http://www.example.com/etcetc
www.example.com/etcetc
example.com/etcetc
user:pass@example.com/etcetc
(www.itmag.com)
example.com/etcetc?query=aasd
example.com/etcetc?query=aasd&dest=asds
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-
match-url-with
www/Christina.V.Scott@gmail.com
line.lundvoll.nilsen@telemed.no.
s.hossain@unsw.edu.au
s.hossain@unsw.edu.au
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Asad
  • 2,782
  • 2
  • 16
  • 17
  • 2
    I tried your regex with my sample text `i opened https://google.com and http://speedtest.net and www.standford.edu` but I don't get proper result. This is how I get `[('https://', 'https', 'm', ''), ('http://', 'http', 't', ''), ('', '', 'u', '')]` – mockash Feb 07 '21 at 13:56
  • It depends on were you are trying. If you are using python(don't need back slash \ chars) or jave or something else. Please try out his one here https://regexr.com/ – Asad Feb 08 '21 at 13:19
  • 1
    Unfortunately this approach matches with some unexpected strings like ```matches if you have.any.point that not necessarily is.a.site```, you can paste on pythex.org to see – luisvenezian Mar 23 '21 at 19:08
  • This doesn't recognize strings like `https://google/` which might be used as valid URLs. Your regex requires a `.com` or `.net` at the end. – CaptAngryEyes Apr 17 '21 at 16:36
6

This is a common problem. Use default libraries.

For Python, use urlparse.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
estani
  • 24,254
  • 2
  • 93
  • 76
  • urlparse would still parse OP's problem URL: urlparse.urlparse('http://example.com/something/!!!') – hoju Jan 09 '14 at 20:56
  • Well that's a valid url, so first of all use an url parser to get the info. Then you can decide what to do with it. I doubt a semantic parser is really what he wants, far more simple is to try the url out. If it doesn work, strip the last characters and try again... – estani Jan 31 '14 at 14:31
2

Based on this link, we can use the library validators.

For example:

import validators

valid = validators.url('https://codespeedy.com/')
if valid == True:
    print("URL is valid")
else:
    print("Invalid URL")
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Alireza Mazochi
  • 897
  • 1
  • 15
  • 22
2

I use this to search for all HTTP and HTTPS URLs. It works like a charm.

URL_PATTERN = "http[s]*\S+"
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Rafey Rana
  • 41
  • 1
1

I'll admit that I'm a little bit worried about an application that requires a regex like that to match URLs. That said, this seems to work for me:

((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
tsm
  • 3,598
  • 2
  • 21
  • 35
1

This is the most complete pattern I use:

URL_PATTERN = r'[A-Za-z0-9]+://[A-Za-z0-9%-_]+(/[A-Za-z0-9%-_])*(#|\\?)[A-Za-z0-9%-_&=]*'
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131