Matching complex URLs within text blocks (R)

Question

I want to use the Regex by John Gruber (http://daringfireball.net/2010/07/improved_regex_for_matching_urls) to match complex URLs in text blocks. The Regex is quite complex (as is the task, see regex to find url in a text).

My problem is that I don't get it work with R:

x <-   c("http://foo.com/blah_blah",
        "http://foo.com/blah_blah/",
        "(Something like http://foo.com/blah_blah)",
        "http://foo.com/blah_blah_(wikipedia)",
         "http://foo.com/more_(than)_one_(parens)",
         "(Something like http://foo.com/blah_blah_(wikipedia))",
         "http://foo.com/blah_(wikipedia)#cite-1",
         "http://foo.com/blah_(wikipedia)_blah#cite-1",
         "http://foo.com/unicode_(✪)_in_parens",
         "http://foo.com/(something)?after=parens",
         "http://foo.com/blah_blah.",
         "http://foo.com/blah_blah/.",
         "<http://foo.com/blah_blah>",
         "<http://foo.com/blah_blah/>",
         "http://foo.com/blah_blah,",
         "http://www.extinguishedscholar.com/wpglob/?p=364.",
         "http://✪df.ws/1234",
         "rdar://1234",
         "rdar:/1234",
         "x-yojimbo-item://6303E4C1-6A6E-45A6-AB9D-3A908F59AE0E",
         "message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff@mail.gmail.com%3e",
         "http://➡.ws/䨹",
         "www.c.ws/䨹",
         "<tag>http://example.com</tag>",
         "Just a www.example.com link.",
         "http://example.com/something?with,commas,in,url, but not at end",
         "What about <mailto:gruber@daringfireball.net?subject=TEST> (including brokets).",
         "mailto:name@example.com",
         "bit.ly/foo",
         "“is.gd/foo/”",
         "WWW.EXAMPLE.COM",
         "http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))/Web_ENG/View_DetailPhoto.aspx?PicId=752",
         "http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))",
         "http://lcweb2.loc.gov/cgi-bin/query/h?pp/horyd:@field(NUMBER+@band(thc+5a46634))")


t <- regexec("\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'".,<>?«»“”‘’]))", x)             

regmatches(x,t)

I appreciate your help.

At least there is not such an option as `perl=TRUE` in the documentation (regexec is the only command that doesn't have this). Even if I use `regexpr` and set `perl=TRUE` it doesn't work. As far as I figured out, the latter part of the Regex (`|[^\\s`!()\\[\\]{};:'".,<>?«»“”‘’]))`) seems to cause the problem. — majom, Apr 17 '13 at 13:21
Well, you can't use regmatches then, and I'm pretty sure `(?i)` is a perlism. — hadley, Apr 17 '13 at 13:22
You are right with `(?i)` this is indeed not needed with R. I changed it in my question. However, the command runs well if I delete the part mentioned above (but then it doesn't recognize all of the exemplary URLs listed above). — majom, Apr 17 '13 at 13:45

majom · Accepted Answer · 2013-04-22T13:47:57.757

I ended up using gregexpr as this supports perl=TRUE. After adapting the Regex for R, I came up with the following solution (use data above).

findURL <- function(x){
  t <- gregexpr("(?xi)\\b(
             (?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)
             (?:[^\\s\\(\\)<>]+|\\(([^\\s\\(\\)<>]+|(\\([^\\s\\(\\)<>]+\\)))*\\))+
             (?:\\(([^\\s\\(\\)<>]+|(\\([^\\s\\(\\)<>]+\\)))*\\)|[^\\s`!\\(\\)\\[\\]{};:'\\\"\\.,<>\\?«»“”‘’])
             )",x, perl=TRUE, fixed=FALSE)
  regmatches(x,t)
} 

# Find URLs
urls <- findURL(x) 

# Count URLs
count.urls.temp <- lapply(urls, length)    
count.urls <- sum(unlist(count.urls.temp))

I hope this is helpful for others.

Matching complex URLs within text blocks (R)

1 Answers1