Getting parts of a URL in JavaScript

Question

I have to match URLs in a text, linkify them, and then display only the host--domain name or IP address--to the user. How can I proceed with JavaScript?

Thanks.
PS: please don't tell me about this; those regular expressions are so buggy they can't match http://google.com

"regular expressions are so buggy they can't match"...not sure how ti fix this false assumption. Are they *ideal*? No, but *correct* regular expressions aren't buggy, unless the engine is. — Nick Craver, Jan 02 '11 at 12:07
`http://google.com` is only a URL fragment; it doesn't have anything to describe what is wanted within the authority domain. (Browsers usually react to this by asking for the root resource, `/`, but that's convention only.) — Donal Fellows, Jan 02 '11 at 12:09

score 2 · Answer 1 · edited Oct 07 '21 at 06:01

~~If you don't want to use regular expressions~~, then you'll need to use things like indexOf and such instead. For instance, search for "://" in the text of every element and if you find it and the bit in front of it looks like a protocol (or "scheme"), grab it and the following characters that are valid URI characters (RFC2396). If the result ends in a dot or question mark, remove the dot or question (it probably ends a sentence). There's not really a lot more to say.

Update: Ah, I see from your edit that you don't have a problem with regular expressions, just the ones in the answers to that question. Fair enough.

This may well be one of those places where trying to do it all with a regular expression is more work that it should be, but using regular expressions as part of the solution is helpful. For instance,

/[a-zA-Z][a-zA-Z0-9+\-.]*:\/\//

...may well be a helpful way to find the beginning of a URL, since the scheme portion must start with an alpha and then can have zero or more alpha, digit, +, -, or . prior to the : (section 3.1).

@sexyprout: Right. That's where the rest of the RFC comes in. (Hostnames are *fairly* straightforward, provided you process escaped characters correctly -- they're everything after the `://` and before the first unescaped `/` or `:`, if any; or end-of-URL-like-characters if there is no unescaped `/` or `:`.) — T.J. Crowder, Jan 02 '11 at 12:39

Getting parts of a URL in JavaScript

1 Answers1