2

I have a chat view, where users can send urls to one another. In case of a url, I want to let the user press on the link and open a web view.

I'm using IFTweetLabel which uses RegexKitLite. Currently the only support available is if the url starts with http/https. I want to support links without the http, for example : www.nytimes.com , and even without the "www" , nytimes.com. (and bunch of other extentions).

This is the http/s prefix reg exp :

@"([hH][tT][tT][pP][sS]?:\\/\\/[^ ,'\">\\]\\)]*[^\\. ,'\">\\]\\)])

Can someone tell me the other regular expressions I need to answer my other requirements.

I tried using This one, but adding it to objective c code generates a lot of issues.

Thanks

Community
  • 1
  • 1
Idan
  • 5,717
  • 10
  • 47
  • 84

3 Answers3

6

The following is John Grubers URL Matching Regex:

(?i)\b(?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])

The following is a regex I came up with by blending a few other regexes I had around and a good chunk of Grubers regex:

(?i)\b(?:(?:[a-z][\w\-]+://(?:\S+?(?::\S+?)?\@)?)|(?:(?:[a-z0-9\-]+\.)+[a-z]{2,4}))(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]*\)))*\))*(?<![\s`!()\[\]{};:'".,<>?«»“”‘’])

The following is a sample program that demonstrates, via RegexKitLite, what each regex matches against the sample text of:

Did you see http://www.stackoverflow.com? Or http://www.stackoverflow.com/?

And then there is www.stackoverflow.com/, along with www.stackoverflow.com/index.

Maybe something like stackoverflow.com with extra stackoverflow.com? Or "stackoverflow.com"?

Perhaps jobs.stackoverflow.com, or 'http://twitter.com/#!/CHOCKENBERRY', the CHOCKLOCK!!

File @file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook?

Maybe http://www.yahoo.com/index///i.html! http://www.yahoo.com/////xyz.html?!

The code:

#import <Foundation/Foundation.h>
#import "RegexKitLite.h"

int main(int argc, char *argv[]) {
  NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

  NSString *urlRegex = @"(?i)\\b(?:(?:[a-z][\\w\\-]+://(?:\\S+?(?::\\S+?)?\\@)?)|(?:(?:[a-z0-9\\-]+\\.)+[a-z]{2,4}))(?:[^\\s()<>]+|\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]*\\)))*\\))*(?<![\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])";

  // John Gruber's URL matching regex from http://daringfireball.net/2010/07/improved_regex_for_matching_urls
  NSString *gruberURLRegex = @"(?i)\\b(?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])";

  NSString *urlString = @"Did you see http://www.stackoverflow.com?  Or http://www.stackoverflow.com/?\n\nAnd then there is www.stackoverflow.com/, along with www.stackoverflow.com/index.\n\nMaybe something like stackoverflow.com with extra stackoverflow.com?  Or \"stackoverflow.com\"?\n\nPerhaps jobs.stackoverflow.com, or 'http://twitter.com/#!/CHOCKENBERRY', the CHOCKLOCK!!\n\nFile @file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook?\n\nMaybe http://www.yahoo.com/index///i.html!  http://www.yahoo.com/////xyz.html?!";

  NSLog(@"String :\n\n%@\n\n", urlString);

  NSLog(@"Matches: %@\n", [urlString componentsMatchedByRegex:urlRegex]);

  NSLog(@"Gruber URL Regex Matches: %@\n", [urlString componentsMatchedByRegex:gruberURLRegex]);

  [pool release]; pool = NULL;
  return(0);
}

Compile with:

shell% gcc -o url url.m RegexKitLite.m -framework Foundation -licucore

When run:

shell% ./url
2011-05-27 20:32:58.204 url[25520:903] String :

Did you see http://www.stackoverflow.com?  Or http://www.stackoverflow.com/?

And then there is www.stackoverflow.com/, along with www.stackoverflow.com/index.

Maybe something like stackoverflow.com with extra stackoverflow.com?  Or "stackoverflow.com"?

Perhaps jobs.stackoverflow.com, or 'http://twitter.com/#!/CHOCKENBERRY', the CHOCKLOCK!!

File @file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook?

Maybe http://www.yahoo.com/index///i.html!  http://www.yahoo.com/////xyz.html?!

2011-05-27 20:32:58.211 url[25520:903] Matches: (
    "http://www.stackoverflow.com",
    "http://www.stackoverflow.com/",
    "www.stackoverflow.com/",
    "www.stackoverflow.com/index",
    "stackoverflow.com",
    "stackoverflow.com",
    "stackoverflow.com",
    "jobs.stackoverflow.com",
    "http://twitter.com/#!/CHOCKENBERRY",
    "file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook",
    "http://www.yahoo.com/index///i.html",
    "http://www.yahoo.com/////xyz.html"
)
2011-05-27 20:32:58.213 url[25520:903] Gruber URL Regex Matches: (
    "http://www.stackoverflow.com",
    "http://www.stackoverflow.com/",
    "www.stackoverflow.com/",
    "www.stackoverflow.com/index",
    "http://twitter.com/#!/CHOCKENBERRY",
    "file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook",
    "http://www.yahoo.com/index///i.html",
    "http://www.yahoo.com/////xyz.html"
)

EDIT 2011/05/27: Made a minor change to the regex to fix a problem where it wasn't matching ( ) parenthesis correctly.

EDIT 2011/05/27: Found some additional corner cases that the regex above didn't handle well. Updated regex:

(?i)\b(?:[a-z][\w\-]+://(?:\S+?(?::\S+?)?\@)?)?(?:(?:(?<!:/|\.)(?:(?:[a-z0-9\-]+\.)+[a-z]{2,4}(?![a-z]))|(?<=://)/))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]*\)))*\))*)(?<![\s`!()\[\]{};:'".,<>?«»“”‘’])

... as an Obj-C string:

@"(?i)\\b(?:[a-z][\\w\\-]+://(?:\\S+?(?::\\S+?)?\\@)?)?(?:(?:(?<!:/|\\.)(?:(?:[a-z0-9\\-]+\\.)+[a-z]{2,4}(?![a-z]))|(?<=://)/))(?:(?:[^\\s()<>]+|\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]*\\)))*\\))*)(?<![\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])";

The OP also asked for how to make sure the trailing TLD was "valid". Here's the same regex, in Obj-C string form, with all the the currently valid TLDs (as of 2011/05/27):

@"(?i)\\b(?:[a-z][\\w\\-]+://(?:\\S+?(?::\\S+?)?\\@)?)?(?:(?:(?<!:/|\\.)(?:(?:[a-z0-9\\-]+\\.)+(?:(ac|ad|ae|aero|af|ag|ai|al|am|an|ao|aq|ar|arpa|as|asia|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|biz|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cat|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|com|coop|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|edu|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gov|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|info|int|io|iq|ir|is|it|je|jm|jo|jobs|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mil|mk|ml|mm|mn|mo|mobi|mp|mq|mr|ms|mt|mu|museum|mv|mw|mx|my|mz|na|name|nc|ne|net|nf|ng|ni|nl|no|np|nr|nu|nz|om|org|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|pro|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tel|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|travel|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|xn--0zwm56d|xn--11b5bs3a9aj6g|xn--3e0b707e|xn--45brj9c|xn--80akhbyknj4f|xn--90a3ac|xn--9t4b11yi5a|xn--clchc0ea0b2g2a9gcd|xn--deba0ad|xn--fiqs8s|xn--fiqz9s|xn--fpcrj9c3d|xn--fzc2c9e2c|xn--g6w251d|xn--gecrj9c|xn--h2brj9c|xn--hgbk6aj7f53bba|xn--hlcj6aya9esc7a|xn--j6w193g|xn--jxalpdlp|xn--kgbechtv|xn--kprw13d|xn--kpry57d|xn--lgbbat1ad8j|xn--mgbaam7a8h|xn--mgbayh7gpa|xn--mgbbh1a71e|xn--mgbc0a9azcg|xn--mgberp4a5d4ar|xn--o3cw4h|xn--ogbpf8fl|xn--p1ai|xn--pgbs0dh|xn--s9brj9c|xn--wgbh1c|xn--wgbl6a|xn--xkc2al3hye2a|xn--xkc2dl3a5ee0h|xn--yfro4i67o|xn--ygbi2ammx|xn--zckzah|xxx|ye|yt|za|zm|zw))(?![a-z]))|(?<=://)/))(?:(?:[^\\s()<>]+|\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]*\\)))*\\))*)(?<![\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])";
johne
  • 6,760
  • 2
  • 24
  • 25
3

This will match both http://example.org and www.example.org.

@"(([hH][tT][tT][pP][sS]?:\\/\\/|www\\.)[^ ,'\">\\]\\)]*\\.[^\\. ,'\">\\]\\)]{2,6})

Although i added a "match group", so check the match/search result returned by the RegExp so the right parameters are re-inserted in the right place.

If you could post the entire code snippet, it would be easier.

RegExp explanation:

(
    (
        [hH][tT][tT][pP][sS]?:\/\/    # Match HTTP/http (and hTtP :)
        |                             # OR
        www\.                         # www<literal DOT>
    )
    [^ ,'\">\]\)]*                    # Match at least 1 character that are not any of space, comma, apostrophe, quotation mark, "more than", "right square bracket", "right parenthese"
    \.                                # Match <literal DOT>
    [^\. ,'\">\]\)]{2,6}              # Match 2-6 characters that are not any of dot, space, comma, apostrophe, quotation mark, "more than", "right square bracket", "right parenthese"
)
joar
  • 15,077
  • 1
  • 29
  • 54
  • awesome, now what about example.org without "www" as well ? – Idan May 26 '11 at 12:00
  • Another thing, in the reg exp you provided, a string like "www.example" still works even tho it's not really valid. I don't want that as well. Can you add these both requirements to the reg exp as well ? Thanks! – Idan May 26 '11 at 12:23
  • That would be really complex. If you're not short of time I'd suggest you go to http://www.regular-expressions.info/tutorial.html. That way you can make the RegExp fully fit your needs. – joar May 26 '11 at 12:40
  • Oh really ? can't I somehow enforce it to end with .com/.net/.org/.edu and more ? I don't mind adding all the options manually. If I can do that along with making the "www" optional, that would be exactly what I need. Thanks – Idan May 26 '11 at 12:49
  • Looks much better... Can you explain what you did ? I'm curious about the {2,6} part at the end. I'm curious if you can provide me with another reg that would enforce the suffix to be .com or .net and etc'(I would add all those manually of course) - a reg exp that don't have to be a part of the one you already gave me. Anyway, thanks a lot, your answer is def accepted! – Idan May 26 '11 at 13:27
  • I've added an explanation for the regular expression. Cheers, – joar May 26 '11 at 14:13
3

You don't want to use a regular expression for this.

You want an NSDataDetector, and it'll find them all for you.

Dave DeLong
  • 242,470
  • 58
  • 448
  • 498