Remove URLs from string

Question

I have a vector of strings—myStrings—in R that look something like:

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

where another url is a valid http url but stackoverflow will not let me insert more than one url thats why i'm writing another url instead. I want to remove all the urls from myStrings to look like:

[1] download file from
[2] this is the link to my website
[3] go to from more info.

I've tried many functions in the stringr package but nothing works.

Can you list your code for a few of the things you've tried? And do all the URLs start with `http://`? — shadowtalker, Aug 17 '14 at 18:45
@DavidArenburg Hi David this works too very straightforward thanks very much!! — Tavi, Aug 17 '14 at 19:02
@ssdecontrol David & Akrun have solved my problem thanks for your input! — Tavi, Aug 17 '14 at 19:03
David answer is wrong since it remove every character after `http`. It means if you have sentence such as : `my url is http://one.com or could be http://two.com`; the result will be `my url is ` instead of `my url is or could be ` — Etienne Kintzler, Jan 24 '18 at 13:41

Rich Scriven · Answer 1 · 2017-02-25T20:40:25.847

18

You can use gsub with a regular expression to match URLs,

Set up a vector:

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

Remove all the URLs from each string:

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"

Update: It would be best if you could post a few different URLs so we know what we're working with. But I think this regular expression will work for the URLs you mentioned in the comments:

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

The above expression explained:

? optional space
(f|ht) match "f" or "ht"
tp match "tp"
(s?) optionally match "s" if it's there
(://) match "://"
(.*) match every character (everything) up to
[.|/] a period or a forward-slash
(.*) then everything after that

I'm not an expert with regular expressions, but I think I explained that correctly.

Note: url shorteners are no longer allowed in SO answers, so I was forced to remove a section while making my most recent edit. See edit history for that part.

edited Feb 25 '17 at 20:40

answered Aug 17 '14 at 18:54

Rich Scriven

97,041
11
181
245

1

regex "((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)" – cdev Aug 17 '14 at 18:58
Really, it removes the URLs on my end. What do your URLs look like exactly? You can put them in quotes the original question. – Rich Scriven Aug 17 '14 at 19:02
Give this one a try `gsub("(f|ht)(tp)(s?)(://)(.*)[.][a-z]{2,6}", "", x)` – Rich Scriven Aug 17 '14 at 19:52
@RichardScriven I still get this result /N1kq0F26tG for urls such as this http://t.co/N1kq0F26tG – Tavi Aug 17 '14 at 19:54
Relevant: http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url – durum Dec 22 '15 at 17:23
I tried the above regex in regex101 and it also removed the rest of the text after the url ends. – salvu Sep 17 '17 at 15:03

Tyler Rinker · Answer 2 · 2014-08-18T02:35:17.430

I've been working on a canned group of regular expressions for common tasks like this that I've thrown into a package, qdapRegex, on github that will eventually go to CRAN. It can also extract the pieces as well as sub them out. Feedback on the package for any taking a look is welcomed.

Here it is:

library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)

x <- c("download file from http://example.com", 
         "this is the link to my website http://example.com", 
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))

## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"         

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)

## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

Edit I saw that twitter links were not removed. I will not be adding this to the regex specific to the rm_url function but have added it to the dictionary in qdapRegex. So there's no specific function to remove standard urls and twitter both but the pastex (paste regular expression) allows you to easily grab regexes from the dictionary and past them together (using the pipe operator, |). Since all rm_XXX style functions work essentially the same you can pass the pastex output to the pattern argument of any rm_XXX function or create your own function as I show below:

rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)

@TylerRinker OMG it works amazing but it didn't remove this url http://t.co/yz8nUKagEH — Tavi, Aug 17 '14 at 20:08
I added the ability to remove twitter links to the package. See answer for explanation. — Tyler Rinker, Aug 18 '14 at 00:42

akrun · Answer 3 · 2014-08-17T19:18:30.803

 str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info")

 gsub('http\\S+\\s*',"", str1)
 #[1] "download file from "                         
 #[2] "this is the link to my website for more info"

 library(stringr)
 str_trim(gsub('http\\S+\\s*',"", str1)) #removes trailing/leading spaces
 #[1] "download file from"                          
 #[2] "this is the link to my website for more info"

Update

In order to match ftp, I would use the same idea from @Richard Scriven's post

  str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info",
  "this link to ftp://www.example.org/community/mail/view.php?f=db/6463 gives more info")


  gsub('(f|ht)tp\\S+\\s*',"", str1)
  #[1] "download file from "                         
  #[2] "this is the link to my website for more info"
  #[3] "this link to gives more info"

score 2 · Answer 4 · edited Nov 09 '18 at 16:18

2

Some previous answers remove beyond the end of the URL and the "\b" extension would help. It could cover also the "sftp://" urls.

For regular urls:

gsub("(s?)(f|ht)tp(s?)://\\S+\\b", "", x)

For tiny urls:

gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", x)

edited Nov 09 '18 at 16:18

M--

25,431
8
61
93

answered Mar 23 '18 at 22:33

Maurício Collaça

401
3
7

Remove URLs from string

4 Answers4

Update

Linked

Related