3

I have a Regex that is able to detect URLs (Disclosure: I copied this Regex from the internet).

My goal is to split a string, so that I get an array of substrings that either are a full URL or not.

For example.

const detectUrls = // some magical Regex
const input = 'Here is a URL: https://google.com <- That was the URL to Google.';

console.log(input.split(detectUrls)); // This should output ['Here is a URL: ', 'https://google.com', ' <- That was the URL to Google.']

My current Regex solution is as follows: /(([a-z]+:\/\/)?(([a-z0-9\-]+\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|local|internal))(:[0-9]{1,5})?(\/[a-z0-9_\-.~]+)*(\/([a-z0-9_\-.]*)(\?[a-z0-9+_\-.%=&amp;]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:@/?]*)?)(\s+|$)/gi;

However, when I run the example code with my regex, I get a useless answer:

[ 'Here is a URL: ', 
  'https://google.com', 
  'https://', 
  'google.com', 
  'google.', 
  'com', 
  undefined, 
  undefined, 
  undefined, 
  undefined, 
  undefined, 
  undefined, 
  ' ', 
  '<- That was the URL to Google.',
]

Would anyone be able to point me in the right direction? Thanks in advance.

  • 1
    In regex `(...)` is called a _capture group_. Your result array has one item for each capture group. A solution would be _named_ capture groups but browser support is probably bad (https://stackoverflow.com/questions/5367369/named-capturing-groups-in-javascript-regex). Instead of writing your own solution why not re-use an existing one? (https://www.google.com/search?q=js+url+extractor&oq=js+url+extractor&aqs=chrome..69i57.2193j0j4&sourceid=chrome&ie=UTF-8) – Sergiu Paraschiv Feb 26 '19 at 14:29

2 Answers2

2

The reason why you are getting multiple matches is that the regex will return a match for each of your groups (the things inside parentheses).
For the result you want you should be using non capture groups (?:myRegex)
I modified your regex so that it should work:

/((?:[a-z]+:\/\/)?(?:(?:[a-z0-9\-]+\.)+(?:[a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|local|internal))(?::[0-9]{1,5})?(?:\/[a-z0-9_\-.~]+)*(?:\/(?:[a-z0-9_\-.]*)(?:\?[a-z0-9+_\-.%=&amp;]*)?)?(?:#[a-zA-Z0-9!$&'(?:)*+.=-_~:@/?]*)?)(?:\s+|$)/

Tip: use an online website like https://regex101.com/ to test your regular expressions.
Also the answer for this question helped a bit:
Use of capture groups in String.split()

szt
  • 21
  • 5
0

Try this:

var detectUrls = /(([a-z]+:\/\/)?(([a-z0-9\-]+\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|local|internal))(:[0-9]{1,5})?(\/[a-z0-9_\-.~]+)*(\/([a-z0-9_\-.]*)(\?[a-z0-9+_\-.%=&amp;]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:@/?]*)?)(\s+|$)/gi;

var input = "Here is a URL: https://google.com";

alert(input.match(detectUrls));

Working Fiddle: https://jsfiddle.net/as2pbe3m/

Rahul Sharma
  • 7,768
  • 2
  • 28
  • 54