0

I want a regular expression for VB.NET to remove all hyperlinks in a string, including protocols https and http, full document name, subdomains, querystring parameters, so all links like:

Here's the string I'm working with in which all links need to be removed:

Dim description As String

description = "Deep purples blanket / wrap. It is gorgeous" & _
"in newborn photography. " & _
"layer" & _
"beneath the baby.....the possibilities are endless!" & _
"You will get this prop! " & _
"Gorgeous images using Lavender as a basket filler " & _
"Photo by Benbrook, TX" & _
"Imaging, Ontario" & _
"http://www.photo.com?t=3" & _
" www.photo.com" & _
" http://photo.com" & _
" https://photo.com" & _
" http://www.photo.nl?t=1&url=5" & _
"Photography Cameron, NC" & _
"Thank you so much ladies!!" & _
"The flower halos has beautiful items!" & _
"http://www.enchanting.etsy.com" & _
"LIKE me on FACEBOOK for coupon codes, and to see my full product line!" & _
"http://www.facebook.com/byme"

What I have now:

description = Regex.Replace(description, _
                    "((http|https|ftp)\://[a-zA-Z0-9\-\.]+(\.[a-zA-Z]{2,3})?(:[a-zA-Z0-9]*)?/?([a-‌​zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*)", "")

It replaces most links, but not links without protocol, like www.example.com

How I alter my expression to include these links?

Adam
  • 6,041
  • 36
  • 120
  • 208

1 Answers1

4

You can split the string with Split() and then check each element. If it can be parsed as an absolute Uri, discard it from the array, and then re-build the string:

Dim urlStr As String
Dim resultUri As Uri
urlStr = "Beautiful images using Lavender, see https://www.foo.com" & vbCrLf & _
    "Plent of links http://www.foo.com/page.html?t=7 Oshawa, Ontario" & vbCrLf & _
    "http://www.example.com" & vbCrLf & "Photography, NC"

Dim resNoURL = String.Join(" ", urlStr.Split().Select(Function(m As String)
                      If Uri.TryCreate(m, UriKind.Absolute, resultUri) = False Then
                          Return m
                      End If
                      End Function).ToList())

Result:

enter image description here

Alternatively, check if m starts with http:// or https://. You can even use a regex check:

Dim rx As Regex = New Regex("(?i)^(?:https?|ftps?)://")

And then in the callback:

If rx.IsMatch(m) = False Then
    Return m
End If

UPDATE

Here is a sample code removing the URLs from the string:

Dim urlStr As String
urlStr = "YOUR STRING"
Dim MyRegex As Regex = New Regex("(?:(http|https|ftp)://|www\.)[a-zA-Z0-9.-]+(\.[a-zA-Z]{2,3})?(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9._?,'/\\+&%$#=~-])*")
Console.WriteLine(MyRegex.Replace(urlStr, ""))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • According to here it should not be required to split the string first: http://stackoverflow.com/a/6811780/769449 I've updated my question with the regex I'm using now, but I'm now missing one part where the link has no protocol in it...can you help? – Adam Sep 07 '15 at 15:45
  • 1
    Do you want to also detect links like `www.something.com` with the regex you have? Try `((?:(http|https|ftp)://|www\.)[a-zA-Z0-9.-]+(\.[a-zA-Z]{2,3})?(:[a-zA-Z0-9]*)?/?([a-‌​zA-Z0-9._?,'/\\+&%$#=~-])*)`. It is the same regex, I just added www and removed unnecessary escaping. – Wiktor Stribiżew Sep 07 '15 at 16:01
  • Yes, I want to detect those links too, but your regular expression does not replace the links in my samle string, which I just added... – Adam Sep 07 '15 at 16:13
  • Look at [this demo, no URLs remain after the replacement](http://ideone.com/kHC58y). – Wiktor Stribiżew Sep 07 '15 at 17:23