first of all, my first post, so please forgive me if I missed something.
The problem is pretty simple. I want to extract all links from a html document. Of course I searched for a solution. I tried at least 30 of them, but none works good enough, most doesn't work at all.
I ended up with this one (VB.Net):
Dim rx As New System.Text.RegularExpressions.Regex("<a\s+(?:[^>]*?\s+)?href=""([^""]*)""")
' Get regex matches
Dim mt As System.Text.RegularExpressions.MatchCollection = rx.Matches( _
"sdfhjkl<a title=""datenkrake"" href=""http://www.google.de"">sdfghj</a>dfTHISISNOTALINK " & _
"href=""narf.com""ghjkl<a href=""www.bing.de"" rel=""not really..."">bullshit</a>df<a href=""/"">local stuff</a>ghj" _
)
' Check regex matches
Diagnostics.Debug.WriteLine("Matches: " & mt.Count)
For i As Integer = 0 To mt.Count - 1
Diagnostics.Debug.WriteLine(" " & mt(i).Value)
Next
Diagnostics.Debug.WriteLine("----------")
' Get URLs from the results
For i As Integer = 0 To mt.Count - 1
Diagnostics.Debug.WriteLine(" " & mt(i).Value.Substring(mt(i).Value.TrimEnd("""").LastIndexOf("""")).Trim(""""))
Next
The debug output:
Matches: 3
<a title="datenkrake" href="http://www.google.de"
<a href="www.bing.de"
<a href="/"
----------
http://www.google.de
www.bing.de
/
This (below the line) is exactly what I want. But isn't this output possible without all this trim and lastindexof stuff?
I'm pretty sure I will never understand this smiley g@ngbang (aka regex)... But for this case performance is important.
Thanks in advance!