3

I looked around for a while, but probably I can't "Google" with the proper keywords.. so I'm here. I need to match an url stripping out protocol to first /

Target: match the first substring from http:// to first / (maybe last / don't exist) or to the end And here come a problem:

i wrote this regex

(?<=//)(.*?)(?=/)

but this regex matches only url with at least 1 '/' in the end excluding the protocol..

here some url to be matched:

  • http://www.google.com/ (matched by my regex)
  • http://www.google.com
  • https://www.google
  • xxx://www.google.com/hello/bleh blah....../
  • xxx://google.com
  • google.com/blah/hello.php?x=11_x.hi
Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
AeonDave
  • 781
  • 1
  • 8
  • 17

4 Answers4

0
^(?:\w+://)?([\w.-]+)/?.*$

(double backslashes for Java) seems to work on all your examples, including a simple www.google.com

PhiLho
  • 40,535
  • 6
  • 96
  • 134
0

Something like...

^(https?:\/\/)?([0-9a-zA-Z][-\w]*[0-9a-zA-Z\.)+[a-zA-Z]{2,6})\/

I saw this in a book I had. That should account for a variable http/https, disallow whitespace, and probably stop at the first slash.

Comment if I did this wrong.

stema
  • 90,351
  • 20
  • 107
  • 135
Phil Colson
  • 141
  • 3
  • 9
  • well i needed to cut off http:// too i maked this in the end ((?:[a-z][a-z\.\d\-]+)\.(?:[a-z][a-z\-]+))(?![\w\.]) ty for suggestion anyway – AeonDave Dec 06 '12 at 03:01
0

This is working for all your example but the last:

(?<=//)[^/\\s]+

[^/\\s] is a negated character class matching every character except / and \s (whitespace, e.g. a space, tab or newline characters)

See it here on Regexr

What will not work is the last row. How do you want to decide what is a link? If I make the first part optional, it will match on every character except / and whitespaces.

stema
  • 90,351
  • 20
  • 107
  • 135
0

It seems like you have the right answer, but you're missing the possibility of not having a trailing "/". Try this:

(?<=//)(.*?)(?=/|$)
Daedalus
  • 1,667
  • 10
  • 12