Replace URL with domain (R)

Question

I want to replace a URL in a string ("Hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example") by its domain ("Hello world stackoverflow.com").

So far I was able to identify and replace the URL by some constant value but not by the URL's domain:

x <- "Hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example"

gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "URL", x)

Any help it highly appreciated.

rvalvik · Answer 1 · 2013-04-14T20:40:08.557

Depending on how important it is to evaluate the url itself you could probably get away with something like:

gsub("(https?://[^/\\s]+)[^\\s]*", "\\1", x)

Would capture http:// with the optional s followed by one or more non whitespace and \ characters as back reference group 1 and then consume zero or more non whitespace characters (greedily). The entire match would then be replaced by the captured group (the domain).

Note: This assumes the url does not contain any spaces.

score 1 · Answer 2 · answered Apr 14 '13 at 20:32

You need to use a back-reference.

Let me preface this by saying that I don't know R but I assume the syntax for back-references is \N where N is the match group.

So if you replace the pattern

https?://([^/\s]++)\S*+

By the string

\1

You should end up replacing the matched pattern with the capture group.

I do not know what the escaping conventions are but you may need to escape the backslash with another backslash.

The pattern broken down is

https? match "http" followed by an optional "s"
:// match the literal "://"
([^/\s]++) match and grab everything until the next slash or space (the domain)
\S*+ match the rest of the URL - until the next whitespace

score 0 · Answer 3 · answered Apr 14 '13 at 20:33

0

You can use grep to scan the a string and extract all values between http:// and / grep -Po 'http://\K.*?(?=/)' Check out http://rfunction.com/archives/1481 and a regex guide here: http://www.regular-expressions.info/

answered Apr 14 '13 at 20:33

KLDavenport

659
8
24

score 0 · Accepted Answer · answered May 07 '14 at 22:54

The problem here is (compared to prior questions on Stackoverflow) that the non-URL part of the string should remain and at the same time the URL should be shorted to its domain.

Based on the post mentioned in my question, I know use the following solution:

x <- "Hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example"

y.1 <- gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", x) 
y.2  <- gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))

z <- paste( y.1, y.2, sep="")

z

It is not the most elegant solution, but it works.

jpmarindiaz · Answer 5 · 2014-08-18T13:07:31.657

0

    library(httr)
    txt <- "hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible"
    l <- lapply(unlist(strsplit(txt," ",fixed=TRUE)),function(w){
           hostname <- parse_url(w)$hostname
           if(is.null(hostname) ) hostname <- w
           hostname
          })
    paste(l,collapse=" ")
    ## hello world stackoverflow.com

edited Aug 18 '14 at 13:07

answered Aug 18 '14 at 09:59

jpmarindiaz

1,599
1
13
21

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](http://stackoverflow.com/help/privileges/comment). – Canavar Aug 18 '14 at 10:20
@Canavarit does now. I had forgotten to add the input string in txt – jpmarindiaz Aug 18 '14 at 13:08

Replace URL with domain (R)

5 Answers5