2

I have to match URLs in a text, linkify them, and then display only the host--domain name or IP address--to the user. How can I proceed with JavaScript?

Thanks.
PS: please don't tell me about this; those regular expressions are so buggy they can't match http://google.com

Community
  • 1
  • 1
seriousdev
  • 7,519
  • 8
  • 45
  • 52
  • 5
    "regular expressions are so buggy they can't match"...not sure how ti fix this false assumption. Are they *ideal*? No, but *correct* regular expressions aren't buggy, unless the engine is. – Nick Craver Jan 02 '11 at 12:07
  • `http://google.com` is only a URL fragment; it doesn't have anything to describe what is wanted within the authority domain. (Browsers usually react to this by asking for the root resource, `/`, but that's convention only.) – Donal Fellows Jan 02 '11 at 12:09
  • I meant "the regular expressions given in that page." – seriousdev Jan 02 '11 at 12:13
  • You’re only looking for absolute URLs, right? – Gumbo Jan 02 '11 at 12:26

1 Answers1

2

If you don't want to use regular expressions, then you'll need to use things like indexOf and such instead. For instance, search for "://" in the text of every element and if you find it and the bit in front of it looks like a protocol (or "scheme"), grab it and the following characters that are valid URI characters (RFC2396). If the result ends in a dot or question mark, remove the dot or question (it probably ends a sentence). There's not really a lot more to say.

Update: Ah, I see from your edit that you don't have a problem with regular expressions, just the ones in the answers to that question. Fair enough.

This may well be one of those places where trying to do it all with a regular expression is more work that it should be, but using regular expressions as part of the solution is helpful. For instance,

/[a-zA-Z][a-zA-Z0-9+\-.]*:\/\//

...may well be a helpful way to find the beginning of a URL, since the scheme portion must start with an alpha and then can have zero or more alpha, digit, +, -, or . prior to the : (section 3.1).

Community
  • 1
  • 1
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • But then I got to extract the hostname. – seriousdev Jan 02 '11 at 12:30
  • @sexyprout: Right. That's where the rest of the RFC comes in. (Hostnames are *fairly* straightforward, provided you process escaped characters correctly -- they're everything after the `://` and before the first unescaped `/` or `:`, if any; or end-of-URL-like-characters if there is no unescaped `/` or `:`.) – T.J. Crowder Jan 02 '11 at 12:39