4

I need a Javascript regular expression that scans a block of plain text and returns the text with the URLs as links.

This is what i have:

findLinks: function(s) {
          var hlink = /\s(ht|f)tp:\/\/([^ \,\;\:\!\)\(\"\'\\f\n\r\t\v])+/g;
          return (s.replace(hlink, function($0, $1, $2) {
              s = $0.substring(1, $0.length);
              while (s.length > 0 && s.charAt(s.length - 1) == '.') s = s.substring(0, s.length - 1);

              return ' ' + s + '';
          }));
      }

the problem is that it will only match http://www.google.com and NOT google.com/adsense

How could I accomplish both?

FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202
Theofanis Pantelides
  • 4,724
  • 7
  • 29
  • 49

4 Answers4

6

I use this a as reference all the time. This guy has 8 regex's you should know.

http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/

Here is what he uses to look for URL's

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ 

He also breaks down what each part does. Very useful for learning regex's and not just getting an answer that works for reasons you don't understand.

Mark S
  • 3,789
  • 3
  • 19
  • 33
MWill
  • 195
  • 1
  • 9
  • 1
    His email regex is missing valid characters like the + sign in the part before the @ sign – CaffGeek Nov 18 '09 at 15:00
  • 1
    Email validation with regex is no trivial matter. I think this is more for learning than for using in hardcore production environments. However the URL pattern has worked well for me. Obviously it's going to need adjustments if your flavor of regex differs. – MWill Nov 18 '09 at 15:33
  • I love you! The link although, not 100% the answer, gave me a good alternative. – Theofanis Pantelides Nov 18 '09 at 17:04
  • The above link is dead, it is now available at: https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149 – Harish ST Apr 13 '21 at 10:59
3

This is a non-trivial task. To match any URI that is valid according to the relevant RFCs you need a monumentally complex regular expression, and even then that won't filter out URIs with invalid top-level domains (e.g. http://brussels.sprout/). So, you have to compromise. Determine what's important to you (examples: are false positives or false negatives more acceptable? Do you want to limit top-level domains to only those that currently exist? Do you allow non-Latin characters in matched URIs?) You should decide what you need you regular expression to do and design it accordingly rather than blindly copying and pasting an example from the web.

Tim Down
  • 318,141
  • 75
  • 454
  • 536
2

You could make the protocol part optional:

/\s((ht|f)tp:\/\/)?([^ \,\;\:\!\)\(\"\'\\f\n\r\t\v])+/g

FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202
0

Try this (works with your sample text)

\S+\.\S+
Rubens Farias
  • 57,174
  • 8
  • 131
  • 162