13

I have a list of urls that I would like to parse and normalize.

I'd like to be able to split each address into parts so that I can identify "www.google.com/test/index.asp" and "google.com/somethingelse" as being from the same website.

Rob Donnelly
  • 2,256
  • 2
  • 20
  • 29
  • 2
    What's your intended output? – Thomas Jun 24 '13 at 21:37
  • The idea case would be something that splits things like [tldextract](https://pypi.python.org/pypi/tldextract), but if that is not readily availible, I would like to get back the string up until the end of the top level domain (e.g. .com or .edu). Preferably it would also strip away http:// and www. and other prefixes like that. – Rob Donnelly Jun 24 '13 at 22:18
  • Could you give us an example output of what you'd expect with your example rather than giving a url for us to parse through and read. – Tyler Rinker Jun 24 '13 at 23:47
  • 2
    There now appears to be a [tldextract package](https://github.com/jayjacobs/tldextract) available for R. Here is [a blogpost describing](http://www.r-bloggers.com/parsing-domain-names-in-r-with-tldextract/) it. – tophcito Feb 10 '15 at 16:20

6 Answers6

12

Since parse_url() uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub call.

Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.

  • ^(?:(?:[[:alpha:]+.-]+)://)? will match the protocol header (copied from parse_url()), we are stripping this away if we find it
  • Also, a potential www. prefix is stripped away, but not captured: (?:www\\.)?
  • Anything up to the subsequent slash will be our fully qualified host name, which we capture: ([^/]+)
  • The rest we ignore: .*$

Now we plug together the regexes above, and the extraction of the hostname becomes:

PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)

Change host name regex to include (but not capture) the port:

HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"

And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:

> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
                "http://test.com/?ex"))
[1] "test.server.com" "google.com"      "test.com"       
Community
  • 1
  • 1
krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • 2
    The advantage of using code from a package is that it comes with unit tests, and you can file bug reports and someone else might fix the bug. – hadley Jun 25 '13 at 08:50
  • @hadley: Thanks for commenting on that. I haven't found unit tests for `parse_url`, though. If they were available, `parse_url` could be rewritten so that a single regular expression is used to capture all parts of an URL. -- Is it by design that the protocol prefix is mandatory for `parse_url`? – krlmlr Jun 25 '13 at 09:01
  • Yeah, I should have said the advantage of a package is that it _could_ come with unit tests. Patches welcome ;) I'd argue that the current answer is correct when the scheme is omitted - if you used that url in a web page, it would not take you to google.com. – hadley Jun 25 '13 at 09:05
  • I also don't see what the advantage of a single extremely complicated regexp is. – hadley Jun 25 '13 at 09:05
  • @hadley: The regexp could be decomposed the way I did in my answer, to make it more readable in the source code. Performance and vectorization would be an immediate advantage: If all components are captured, replacing in `gsub` by `"\\1\n\\2\n\\3\n..."` and splitting all strings afterwards should be faster than chewing the URL bit by bit. – krlmlr Jun 25 '13 at 09:09
11

You can use the function of the R package httr

 parse_url(url) 
 >parse_url("http://google.com/")

You can get more details here: http://cran.r-project.org/web/packages/httr/httr.pdf

Abdocia
  • 255
  • 1
  • 8
  • 1
    Could you please provide example output for one of the URLs the OP has provided? – krlmlr Jun 24 '13 at 21:47
  • 1
    This is getting a ton of upvotes so I must me missing something. How can this help determine what urls belong together? – Tyler Rinker Jun 24 '13 at 22:08
  • `parse_url("www.google.com/test/index.asp")$path` gives a result of `"www.google.com/test/index.asp"` which is not very helpful. – Rob Donnelly Jun 24 '13 at 22:09
  • 1
    @TylerRinker: I, too, have assumed that this function *must* split the URL into its building blocks, without really reading the docs. I have added the missing parts in [my answer](http://stackoverflow.com/a/17286485/946850). – krlmlr Jun 24 '13 at 23:01
  • 1
    @TylerRinker: Actually, `parse_url` *does* split the URL into its building blocks. It's just that the protocol prefix is mandatory, and the split will be incorrect if the protocol prefix is missing. Updated my answer. – krlmlr Jun 25 '13 at 01:15
6

There's also the urltools package, now, which is infinitely faster:

urltools::url_parse(c("www.google.com/test/index.asp", 
                      "google.com/somethingelse"))

##                  scheme         domain port           path parameter fragment
## 1        www.google.com      test/index.asp                   
## 2            google.com       somethingelse                   
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • 1
    This is really much better than the httr::url_parse, not only for the blazing speed but also for the vectorisation (no need to use *apply) – haddr Feb 05 '16 at 02:18
4

I'd forgo a package and use regex for this.

EDIT reformulated after the robot attack from Dason...

x <- c("talkstats.com", "www.google.com/test/index.asp", 
    "google.com/somethingelse", "www.stackoverflow.com",
    "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk=")

parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1))
parser(x)

lst <- lapply(unique(parser(x)), function(var) x[parser(x) %in% var])
names(lst) <- unique(parser(x))
lst

## $talkstats.com
## [1] "talkstats.com"
## 
## $google.com
## [1] "www.google.com/test/index.asp" "google.com/somethingelse"     
## 
## $stackoverflow.com
## [1] "www.stackoverflow.com"
## 
## $bing.com
## [1] "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk="

This may need to be extended depending on the structure of the data.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Somebody actually has built an [RFC-compliant regular expression](http://stackoverflow.com/a/190405/946850) for URLs. This is not for the faint of heart, and a dedicated URL parser should be preferred here... – krlmlr Jun 24 '13 at 21:46
  • 2
    `x <- "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-10&sp=-1&sk="` You just identified bing as google. – Dason Jun 24 '13 at 21:50
  • 2
    @Dason who searches for google with bing? :) I'll update but I think the OP doesn't want to supply a pattern explicitly. – Tyler Rinker Jun 24 '13 at 21:51
  • People trying to destroy raptor's hopes and dreams. – Dason Jun 24 '13 at 21:52
  • @TylerRinker: R can handle [PCREs](http://www.pcre.org/) for sure (set `perl=T`), and you can extract the relevant parts by substituting some of the `(?:` in the supplied regular expression by `(` and, say, `gsub`bing with `"\\1"` to get the value of the first placeholder. – krlmlr Jun 24 '13 at 21:53
  • `x <- http://www.talkstats.com/showthread.php/45040-www.google.com?p=130154#post130154` No bing search there :p – Dason Jun 24 '13 at 21:54
  • 6
    Let's be honest, "google.com" is one of the most common searches on bing. – Rob Donnelly Jun 24 '13 at 22:05
  • @krlmlr that sounds cool but is way above my pay grade (i.e., my ability level). Would you mind showing as an answer? – Tyler Rinker Jun 24 '13 at 22:06
  • Your solution doesn't handle https properly. – Dason Jun 24 '13 at 22:17
  • @TylerRinker: I still consider using custom regular expressions to parse URLs less than optimal, especially (as just seen) if there is a library available to do just that. That's why I'm not posting an answer, but let me elaborate on what I wrote earlier. The regular expression I have pointed to is one that R can interpret by supplying `perl=T` to `grep`, `gsub` and friends. Note the strangely looking `(?:` construct -- this is a [non-capturing subpattern](http://php.net/manual/en/regexp.reference.subpatterns.php). Also remember that plain parentheses allow *capturing* the matched text... – krlmlr Jun 24 '13 at 22:18
  • @TylerRinker: ...and reusing it with the `\#` syntax where `#` is a digit. Reusing works in both the pattern and the replacement text (in `gsub`). Just try: `gsub("^.* have ([0-9]+) .*$", "\\1", "I have 17 apples")` – krlmlr Jun 24 '13 at 22:21
3

Building upon R_Newbie's answer, here's a function that will extract the server name from a (vector of) URLs, stripping away a www. prefix if it exists, and gracefully ignoring a missing protocol prefix.

domain.name <- function(urls) {
    require(httr)
    require(plyr)
    paths <- laply(urls, function(u) with(parse_url(u),
                                          paste0(hostname, "/", path)))
    gsub("^/?(?:www\\.)?([^/]+).*$", "\\1", paths)
}

The parse_url function is used to extract the path argument, which is further processed by gsub. The /? and (?:www\\.)? parts of the regular expression will match an optional leading slash followed by an optional www., and the [^/]+ matches everything after that but before the first slash -- this is captured and effectively used in the replace text of the gsub call.

> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
                "http://test.com/?ex"))
[1] "test.server.com" "google.com"      "test.com"       
krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • There's a ton of extra overhead in this in that `parse_url` generates a larger list of information than what is needed and the output still needs to be regexed. I really think that the perl approach you alluded to may be useful to see. – Tyler Rinker Jun 24 '13 at 23:46
2

If you like tldextract one option would be to use the version on appengine

require(RJSONIO)
test <- c("test.server.com/test", "www.google.com/test/index.asp", "http://test.com/?ex")
lapply(paste0("http://tldextract.appspot.com/api/extract?url=", test), fromJSON)
[[1]]
   domain subdomain       tld 
 "server"    "test"     "com" 

[[2]]
   domain subdomain       tld 
 "google"     "www"     "com" 

[[3]]
   domain subdomain       tld 
   "test"        ""     "com" 
user1609452
  • 4,406
  • 1
  • 15
  • 20