1

I want to replace a URL in a string ("Hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example") by its domain ("Hello world stackoverflow.com").

So far I was able to identify and replace the URL by some constant value but not by the URL's domain:

x <- "Hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example"

gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "URL", x) 

Any help it highly appreciated.

majom
  • 7,863
  • 7
  • 55
  • 88

5 Answers5

2

Depending on how important it is to evaluate the url itself you could probably get away with something like:

gsub("(https?://[^/\\s]+)[^\\s]*", "\\1", x)

Would capture http:// with the optional s followed by one or more non whitespace and \ characters as back reference group 1 and then consume zero or more non whitespace characters (greedily). The entire match would then be replaced by the captured group (the domain).

Note: This assumes the url does not contain any spaces.

rvalvik
  • 1,559
  • 11
  • 15
1

You need to use a back-reference.

Let me preface this by saying that I don't know R but I assume the syntax for back-references is \N where N is the match group.

So if you replace the pattern

https?://([^/\s]++)\S*+

By the string

\1

You should end up replacing the matched pattern with the capture group.

I do not know what the escaping conventions are but you may need to escape the backslash with another backslash.

The pattern broken down is

  • https? match "http" followed by an optional "s"
  • :// match the literal "://"
  • ([^/\s]++) match and grab everything until the next slash or space (the domain)
  • \S*+ match the rest of the URL - until the next whitespace
Boris the Spider
  • 59,842
  • 6
  • 106
  • 166
0

You can use grep to scan the a string and extract all values between http:// and / grep -Po 'http://\K.*?(?=/)' Check out http://rfunction.com/archives/1481 and a regex guide here: http://www.regular-expressions.info/

KLDavenport
  • 659
  • 8
  • 24
0

The problem here is (compared to prior questions on Stackoverflow) that the non-URL part of the string should remain and at the same time the URL should be shorted to its domain.

Based on the post mentioned in my question, I know use the following solution:

x <- "Hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example"

y.1 <- gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", x) 
y.2  <- gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))

z <- paste( y.1, y.2, sep="")

z

It is not the most elegant solution, but it works.

majom
  • 7,863
  • 7
  • 55
  • 88
0
    library(httr)
    txt <- "hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible"
    l <- lapply(unlist(strsplit(txt," ",fixed=TRUE)),function(w){
           hostname <- parse_url(w)$hostname
           if(is.null(hostname) ) hostname <- w
           hostname
          })
    paste(l,collapse=" ")
    ## hello world stackoverflow.com
jpmarindiaz
  • 1,599
  • 1
  • 13
  • 21
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](http://stackoverflow.com/help/privileges/comment). – Canavar Aug 18 '14 at 10:20
  • @Canavarit does now. I had forgotten to add the input string in txt – jpmarindiaz Aug 18 '14 at 13:08