1

I was able to extract href value of anchors in an html string. Now, what I want to achieve is extract the href value and replace this value with a new GUID. I need to return both the replaced html string and list of extracted href value and it's corresponding GUID.

Thanks in advance.

My existing code is like:

Dim sPattern As String = "<a[^>]*href\s*=\s*((\""(?<URL>[^\""]*)\"")|(\'(?<URL>[^\']*)\')|(?<URL>[^\s]* ))"

Dim matches As MatchCollection = Regex.Matches(html, sPattern, RegexOptions.IgnoreCase Or RegexOptions.IgnorePatternWhitespace)

If Not IsNothing(matches) AndAlso matches.Count > 0 Then
    Dim urls As List(Of String) = New List(Of String)

    For Each m As Match In matches
      urls.Add(m.Groups("URL").Value)
    Next
End If

Sample HTML string:

<html><body><a title="http://www.google.com" href="http://www.google.com">http://www.google.com</a><br /><a href="http://www.yahoo.com">http://www.yahoo.com</a><br /><a title="http://www.apple.com" href="http://www.apple.com">Apple</a></body></html>
user557670
  • 13
  • 3

1 Answers1

1

You could do something like that:

Dim pattern As String = "<a[^>]*href\s*=\s*((\""(?<URL>[^\""]*)\"")|(\'(?<URL>[^\']*)\')|(?<URL>[^\s]* ))"
Dim urls As New Dictionary(Of Guid, String)
Dim evaluator As MatchEvaluator = Function(m)
    Dim g As Guid = Guid.NewGuid()
    Dim url = m.Groups("URL").Value
    urls.Add(g, url)
    Return m.Value.Replace(url, g.ToString())
End Function

Dim newHtml = Regex.Replace(html, pattern, evaluator)

In the end, newHtml has the following value:

<html><body><a title="329eb2c4-ee51-49fa-a8cd-2de319c3dbad" href="329eb2c4-ee51-49fa-a8cd-2de319c3dbad">http://www.google.com</a><br /><a href="77268e2d-87c4-443c-980c-9188e22f8496">http://www.yahoo.com</a><br /><a title="2941f77a-a143-4990-8ad7-3ef56972a8d4" href="2941f77a-a143-4990-8ad7-3ef56972a8d4">Apple</a></body></html>

And the urls dictionary contains the following entries:

329eb2c4-ee51-49fa-a8cd-2de319c3dbad: http://www.google.com
77268e2d-87c4-443c-980c-9188e22f8496: http://www.yahoo.com
2941f77a-a143-4990-8ad7-3ef56972a8d4: http://www.apple.com

By the way, note that regular expressions are not the best option for parsing HTML... A tool like HTML Agility Pack would be more adequate.

Community
  • 1
  • 1
Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758