1

I need a regular expression for javascript that will get "jones.com/ca" from "Hello we are jones.com/ca in Tampa". The "jones.com/ca" could be any web url extension (example: .net, .co, .gov, etc), and any name. So the regular expression needs to find all instances of say ".com" and all the text to the last white space or beginning of line and to the last white space or end of line (minus any ending punctuation).

Right now I have as an example line: "jones.com/ca some text", using a javascript regular expression of: "\\(.+?^\\s).com?([^\\s]+)?\\", and all I get is ".com/ca" as the output.

Sebastian Paaske Tørholm
  • 49,493
  • 11
  • 100
  • 118
  • 1
    "any web url extension", if you mean a valid TLD, is a *very* long list. Much longer than just .com .net .co and .gov. May be best to match something that looks like it might be a TLD. – Jim Blackler Apr 06 '11 at 13:12
  • You need to let me know how detailed you expect the regex to be and I can edit my post. I am actually the Joe that posted the last scratched out regex at this link http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx – Joe Apr 06 '11 at 13:20

3 Answers3

0

This example will capture specific domains com,org and gov

\b\w+\.(?:com|org|gov)/[a-z]{2}\b

And this will capture almost any domain

\b\w+\.[a-z]{2,3}/[a-z]{2}\b

It uses word boundaries so that it does not capture white space.

Joe
  • 56,979
  • 9
  • 128
  • 135
0

Matching URLs is a bit of a dark art. The following site has a fairly well-designed regex for this purpose: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
0

A comprehensive regex for this is going to be much more complicated than you think. The list of top-level domains is fairly long (.gov, .info, .edu, .museum, etc.), and there are "special" domains like localhost as well. Also, many domains end in a two-letter country abbreviation (google.com.br for Google Brazil, for example, or del.icio.us).

The easiest thing would be to look for http(s):// or www at the beginning and just assume what comes after is a domain name. If you don't, you're going to either miss a lot, or get a lot of false positives.

You could try the following, but the last option (after the last |) is going to be open to a significant number of false positives:

/https?:\/\/\S+|www\.\S+|([-a-z0-9_]+\.)+(com|org|edu|gov|mil|info|[a-z]{2})(\/\S*)?|([-a-z0-9_]+\.)+[-a-z0-9_]+\/\S*/ig
Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104