1

I'm currently having some issues with a regex to extract a URL.

I want my regex to take URLS such as:

http://stackoverflow.com/questions/ask
https://stackoverflow.com
http://local:1000
https://local:1000

Through some tutorials, I've learned that this regex will find all the above: ^(http|https)\://.*$ however, it will also take http://local:1000;http://invalid http://khttp://as a single string when it shouldn't take it at all.

I understand that my expression isn't written to exclude this, but my issue is I cannot think of how to write it so it checks for this scenario.

Any help is greatly appreciated!

Edit:

Looking at my issue, it seems that I could eliminate my issue as long as I can implement a check to make sure '//' doesn't occur in my string after the initial http:// or https://, any ideas on how to implement?

Sorry this will be done with Java

I also need to add the following constraint: a string such as http://local:80/test:90 fails because of the duplicate of port...aka I need to have a constraint that only allows two total : symbols in a valid string (one after http/s) and one before port.

JDB
  • 25,172
  • 5
  • 72
  • 123
user2019260
  • 143
  • 4
  • 10
  • You want to extract the url without the protocol? – ichigolas Jan 28 '13 at 18:58
  • Hi, if the string contains multiple urls such as http://http://k.http://blah it shouldn't be found as valid in my regex – user2019260 Jan 28 '13 at 19:10
  • Yes, as long as it's not another URL it's fine. – user2019260 Jan 28 '13 at 19:19
  • Looking at my issue, it seems that I could eliminate my issue as long as I can implement a check to make sure '//' doesn't occur in my string after the initial http:// or https://, any ideas on how to implement? – user2019260 Jan 28 '13 at 19:23
  • 1
    Please read the [regex] tag's description: "Please also include a tag specifying the programming language or tool you are using." – JDB Jan 28 '13 at 19:25

4 Answers4

1

Check your programming language to see if it already has a parser. E.g. php has parse_url()

Greg
  • 12,119
  • 5
  • 32
  • 34
1

This will only produce a match if if there is no :// after its first appearance in the string.

^https?:\/\/(?!.*:\/\/)\S+

Note that trying to parse a valid url from within a string is very complex, see
In search of the perfect URL validation regex, so the above does not attempt to do that.
It will just match the protocol and following non-space characters.

In Java

Pattern reg = Pattern.compile("^https?:\\/\\/(?!.*:\\/\\/)\\S+");
Matcher m = reg.matcher("http://somesite.com"); 
if (m.find()) {
    System.out.println(m.group());
} else {
    System.out.println("No match");
}
MikeM
  • 13,156
  • 2
  • 34
  • 47
  • Seems like this is what I need, any idea how to do this in java? – user2019260 Jan 28 '13 at 19:45
  • @Greg. Yes, that's great, but it assumes that you have already got the url. – MikeM Jan 28 '13 at 19:55
  • Mike- Thank you, this works great. One question, if I wanted to add into the contraints that a second colon in the string also makes it invalid (Ex: "https://local:800/test:5") how would I go about doing that? – user2019260 Jan 28 '13 at 21:21
  • @user2019260. If you mean a _third_ colon, you could use `^https?:\\/\\/(?!.*:(.*:|\\/\\/))\\S+` This will disallow `://` or two `:` in the string after `http://`. – MikeM Jan 28 '13 at 22:36
0

From http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

This may change based on the programming language/tool

JDB
  • 25,172
  • 5
  • 72
  • 123
0
/[A-Za-z]+://[A-Za-z0-9-]+.[A-Za-z0-9-:%&;?#/.=]+/g
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103