How to remove urls without http in a text document using r

Question

I am trying to remove urls that may or may not start with http/https from a large text file, which I saved in urldoc in R. The url may start like tinyurl.com/ydyzzlkk or aclj.us/2y6dQKw or pic.twitter.com/ZH08wej40K. Basically I want to remove data before a '/' after finding the space and after a "/" until I find a space. I tried with many patterns and searched many places. Couldn't complete the task. I would help me a lot if you could give some input.

This is the last statement I tried and got stuck for the above problem. urldoc = gsub("?[a-z]+\..\/.[\s]$","", urldoc)

Input would be: A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk

Output I am expecting is: A disgrace to his profession. In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. nothing like the admin. proposal:

Thanks.

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Show what you tried and say exactly where you are getting stuck. — MrFlick, Oct 30 '18 at 19:08
Try `urldoc = gsub("\\s*[^ /]+/[^ /]+","", urldoc)`, see [regex demo](https://regex101.com/r/4pqRly/1). — Wiktor Stribiżew, Oct 30 '18 at 19:17
Thanks for the input. I added the input, output and the statement I tried. Please check. — srk3124, Oct 30 '18 at 19:17

score 2 · Accepted Answer · answered Oct 30 '18 at 19:22

2

According to your specs, you may use the following regex:

\s*[^ /]+/[^ /]+

See the regex demo.

Details

\s* - 0 or more whitespace chars
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /
/ - a slash
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /.

R demo:

urldoc = gsub("\\s*[^ /]+/[^ /]+","", urldoc)

If you want to account for any whitespace, replace the literal space with [:space:],

urldoc = gsub("\\s*[^[:space:]/]+/[^[:space:]/]+","", urldoc)

answered Oct 30 '18 at 19:22

Wiktor Stribiżew

607,720
39
448
563

Thank you. It worked. But it left only one partial url, "https://". Everything else has been removed. Can you suggest any regex, to remove even that? – srk3124 Oct 30 '18 at 19:33
@srk3124 `urldoc = gsub("\\s*(?:https?://)?[^ /]+/[^ /]+","", urldoc)`? Or `urldoc = gsub("\\s*(?:[^ /]+/[^ /]+|https?://)","", urldoc)`? – Wiktor Stribiżew Oct 30 '18 at 19:36
Thank you. The problem is solved. I am a beginner in R and I am trying hard to catch up. Can you suggest me any book to get a good knowledge on R? – srk3124 Oct 30 '18 at 19:57
@srk3124 I think you should find something for your specific needs yourself, I can only recommend following the R tag here on SO. See [this R book resource page](https://www.r-project.org/doc/bib/R-books.html), maybe it will help you find something. – Wiktor Stribiżew Oct 30 '18 at 20:29

score 1 · Answer 2 · answered Oct 30 '18 at 19:30

See already answered, but here is an alternative if you've not come across stringi before

# most complete package for string manipulation
library(stringi)

# text and regex
text <- "A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk" 
pattern <- "(?:\\s)[^\\s\\.]*\\.[^\\s]+"

# see what is captured
stringi::stri_extract_all_regex(text, pattern)

# remove (replace with "")
stringi::stri_replace_all_regex(text, pattern, "")

score 0 · Answer 3 · answered Oct 30 '18 at 19:24

0

This might work:

text <- " http:/thisisanurl.wde , thisaint , nope , uihfs/yay"
words <- strsplit(text, " ")[[1]]
isurl <- sapply(words, function(x) grepl("/",x))
result <- paste0(words[!isurl], collapse = " ")
result
[1] " , thisaint , nope ,"

answered Oct 30 '18 at 19:24

gaut

5,771
1
14
45

How to remove urls without http in a text document using r

3 Answers3