1

I have a JS function which is passed a string that a RegEx is run against, and returns any matches:

searchText= // some string which may or may not contain URLs
Rxp= new RegExp("([a-zA-Z\d]+://)?(\w+:\w+@)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(/.*)?/ig")
return searchText.match(Rxp);

The RegExp should return matches for any of the following (and similar derivations):

However, no such luck. Any suggestions?

SW4
  • 69,876
  • 20
  • 132
  • 137
  • I know it's not an answer, but I suggest you to see this website : http://daringfireball.net/2010/07/improved_regex_for_matching_urls – Alexandre Khoury Jun 29 '12 at 10:43
  • If you want to specify a regex by things it should match, it'd help to say what patterns it shouldn't match. – perfectionist Jun 29 '12 at 10:51
  • I'd seen the Daring Fireball one, but it doesnt match e.g. 'google.com' – SW4 Jun 29 '12 at 10:54
  • @ErgoSummary Yes because it shouldn't. For exemple, if you write "hi all.how are you?" it will be converted to "hi http://all.how are you?" See a full explanation here : http://stackoverflow.com/a/10505843/1365010 (Note that stackoverflow doesn't match google.com too) – Alexandre Khoury Jun 29 '12 at 11:00
  • @user1365010, I know the pitfalls, however it needs to match in this incidence (various other cross checks take place), the regex is a loose first stage identifier of 'url like strings' – SW4 Jun 29 '12 at 11:01
  • 1
    @ErgoSummary if you really need to match it, see the regex in this answer http://stackoverflow.com/a/10505843/1365010 – Alexandre Khoury Jun 29 '12 at 11:04

2 Answers2

3

In a string, \ has to be escaped: \\.

First, the string is interpreted. \w turns in w, because it has no significant meaning.
Then, the parsed string is turned in a RegEx. But \ is lost during the string parsing, so your RegEx breaks.

Instead of using the RegExp constructor, use RegEx literals:

Rxp = /([a-zA-Z\d]+:\/\/)?(\w+:\w+@)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(\/.*)?/ig;
// Note: I recommend to use a different variable name. Variables starting with a
//  capital usually indicate a constructor, by convention.

If you're not 100% sure that the input is a string, it's better to use the exec method, which coerces the argument to a string:

return Rxp.exec(searchText);

Here's a pattern which includes the query string and URL fragment:

/([a-zA-Z\d]+:\/\/)?(\w+:\w+@)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(\/[^?#\s]*‌)?(\?[^#\s]*)?(#\S*)?/ig
Rob W
  • 341,306
  • 83
  • 791
  • 678
  • Also, the `/ig` at the end is not going to work as expected. Better use a regex literal instead. – Tim Pietzcker Jun 29 '12 at 10:34
  • +1 for the literal. However, now you have to escape the slashes inside the regex. – Tim Pietzcker Jun 29 '12 at 10:35
  • Many thanks, looks like a huge improvement- for matches that end with '?querystring=value' (i.e. a URL querystring) that part is omitted from the returned value? – SW4 Jun 29 '12 at 10:38
  • @ErgoSummary Do you want to include the query string? Use something along the lines of `(\?[^#\s]+)?` (assuming that your URLs end where a space starts). – Rob W Jun 29 '12 at 10:42
  • If present the query string should be returned, the URLs will end either with a space, or at the end of the string itself. How would I incorporate the above into the overall string? Apologies for my lack of RegExp know-how – SW4 Jun 29 '12 at 10:44
  • @Rob W, Thanks- however it returns an error (eg on http://www.regexplanet.com/advanced/javascript/index.html)? – SW4 Jun 29 '12 at 10:58
  • 1
    @ErgoSummary There is an invisible character in the pattern. `/([a-zA-Z\d]+:\/\/)?(\w+:\w+@)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(\/[^?#\s]*)?(\?[^#\s]*)?(#\S*)?/ig.exec('http://www.regexplanet.com/advanced/javascript/index.html')` works fine. EDIT: Again. I'll add the pattern to my answer instead. – Rob W Jun 29 '12 at 11:15
1

Firstly, there's no real need to create your pattern via the RegExp constructor since it doesn't contain anything dynamic. You can just use the literal /pattern/ instead.

If you do use the constructor, though, you have to remember your pattern is declared as a string, not a literal REGEXP, so you'll need to double-escape special characters, e.g. \\d, not \d. Also, there were several forward slashes you weren't escaping at all.

With the constructor, modifiers (g, i) are passed as a second argument, not appended to the pattern.

So to literally change what you have, it would be:

Rxp= new RegExp("([a-zA-Z\\d]+:\\/\\/)?(\\w+:\\w+@)?([a-zA-Z\\d.-]+\\.[A-Za-z]{2,4})(:\\d+)?(\\/.*)?", "ig")

But better would be:

Rxp = /([a-zA-Z\d]+:\/\/)?(\w+:\w+@)?([a-zA-Z\d.-]+\.[A-Za-z]{2,4})(:\d+)?(\/.*)?/gi;
Mitya
  • 33,629
  • 9
  • 60
  • 107