2

I would like to remove multiple web URLs from a string. If the string is the following:

this is a URL http://test.com and another one http://test.com/hi and this one http://www.test.com/

It should return

this is a URL and another one and this one

I tried using the following code:

gsub(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "", string)

But it returns me this:

this is a URL

vdvaxel
  • 667
  • 1
  • 14
  • 41

3 Answers3

2

This one will also work, instead of (.*) we can use [^\\.]* (till the dot of the domain) and \\S* to match till the end of the url (until a space is found):

gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", "", string)
# [1] "this is a URL and another one and this one"
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
1

.* will match until the end of string without constraint, so all parts after the first url are removed, normally urls don't contain white space, you can use \\S(match none white space) instead of .(match any character) to avoid the problem:

gsub(" ?(f|ht)(tp)s?(://)(\\S*)[./](\\S*)", "", string)
# [1] "this is a URL and another one and this one"
Psidom
  • 209,562
  • 33
  • 339
  • 356
1

You can try using the following regex / code :

gsub("https?:\\/\\/(.*?|\\/)(?=\\s|$)\\s?", "", string)
# [1] "this is a URL and another one and this one"

DEMO

m87
  • 4,445
  • 3
  • 16
  • 31