-2

I'm trying to match a URL in a string of text and I'm using this regex to search for a URL :

/\b(https?:\/\/.*?\.[a-z]{2,4}\b)/g

The problem is, it only ever matches the protocol and domain, and nothing else that follows.

Example :

let regEx = /\b(https?:\/\/.*?\.[a-z]{2,4}\b)/g;
let str = 'some text https://website.com/sH6Sd2x some more text';

console.log(str.match(regEx));

Returns :

https://website.com

How would I alter the regex so it will return the full URL?

https://website.com/sH6Sd2x

Working Demo :

let regEx = /\b(https?:\/\/.*?\.[a-z]{2,4}\b)/g;
let str = 'some text https://website.com/sH6Sd2x some more text';
console.log(str.match(regEx));
spice
  • 1,442
  • 19
  • 35
  • Your regexp ends with `\.{a-z]{2,4}\b`, so that will only match the top-level domain part of the URL. – Barmar Nov 25 '18 at 21:05
  • @Barmar, yes thanks, I'm aware of that. My question was how to alter the regex to include the rest? – spice Nov 25 '18 at 21:07
  • 1
    A usual URL extraction pattern assumes there are no whitespaces after protocol. Try just `/\bhttps?:\/\/\S+\b/g`, see [demo](https://regex101.com/r/btltNG/1) – Wiktor Stribiżew Nov 25 '18 at 21:07
  • @WiktorStribiżew yep that's it, thank you very much :) – spice Nov 25 '18 at 21:08

2 Answers2

3

The reason it stops there is that your expression ends with \.[a-z]{2,4} which I guess is intended to match the top level domain (.com, .net, uk etc). After that it stops matching.

The solution: add \/[^\s]* to the expression. This matches a further slash and zero or more non-whitespace characters.

Note that \S (with capital S) is equivalent to [^\s] (with lowercase s), so use what you like best.

Demo:

let regEx = /\b(https?:\/\/.*?\.[a-z]{2,4}\/[^\s]*\b)/g;
let str = 'some text https://website.com/sH6Sd2x some more text';

console.log(str.match(regEx));

You might even shorten it further if you realize that URLs never contain whitespace, and matching the domain explicitly is not needed, or worse it may even cause trouble (e.g. .museum is also a valid TLD, but you exclude it).

Enhanced version (shorter regex and more accurate):

let regEx = /\b(https?:\/\/\S*\b)/g;
let str = 'some text https://website.com/sH6Sd2x some more text';

console.log(str.match(regEx));
Peter B
  • 22,460
  • 5
  • 32
  • 69
-1

Since the regexp ends with \.[a-z]{2,4}\b, it only matches up to the top-level domain part of the hostname in the URL. You need to match the rest of the URL after that. This matches any non-whitespace characters after that:

let regEx = /\bhttps?:\/\/.*?\.[a-z]{2,4}\b\S*/g;

See Detect URLs in text with JavaScript for more complete solutions to matching URLs.

Barmar
  • 741,623
  • 53
  • 500
  • 612