0

first of all, my first post, so please forgive me if I missed something.

The problem is pretty simple. I want to extract all links from a html document. Of course I searched for a solution. I tried at least 30 of them, but none works good enough, most doesn't work at all.

I ended up with this one (VB.Net):

    Dim rx As New System.Text.RegularExpressions.Regex("<a\s+(?:[^>]*?\s+)?href=""([^""]*)""")

    ' Get regex matches
    Dim mt As System.Text.RegularExpressions.MatchCollection = rx.Matches( _
      "sdfhjkl<a title=""datenkrake"" href=""http://www.google.de"">sdfghj</a>dfTHISISNOTALINK " & _
      "href=""narf.com""ghjkl<a href=""www.bing.de"" rel=""not really..."">bullshit</a>df<a href=""/"">local stuff</a>ghj" _
    )

    ' Check regex matches
    Diagnostics.Debug.WriteLine("Matches: " & mt.Count)
    For i As Integer = 0 To mt.Count - 1
        Diagnostics.Debug.WriteLine("  " & mt(i).Value)
    Next

    Diagnostics.Debug.WriteLine("----------")

    ' Get URLs from the results
    For i As Integer = 0 To mt.Count - 1
        Diagnostics.Debug.WriteLine("  " & mt(i).Value.Substring(mt(i).Value.TrimEnd("""").LastIndexOf("""")).Trim(""""))
    Next

The debug output:

    Matches: 3
      <a title="datenkrake" href="http://www.google.de"
      <a href="www.bing.de"
      <a href="/"
    ----------
      http://www.google.de
      www.bing.de
      /

This (below the line) is exactly what I want. But isn't this output possible without all this trim and lastindexof stuff?

I'm pretty sure I will never understand this smiley g@ngbang (aka regex)... But for this case performance is important.

Thanks in advance!

tightDev
  • 11
  • 2
  • 1
    You seem to have missed [this *`Html Agility Pack`* solution](https://stackoverflow.com/questions/2248411/get-all-links-on-html-page) – Wiktor Stribiżew Jul 03 '18 at 16:50
  • 1
    Using Html Agility Pack is one of the options, are you required to use regex? – CruleD Jul 03 '18 at 16:54
  • What is the real HTML? Could you point to a page where is it? It could happen that it would be possible to convert your data into XML. At least it could work. – JohnyL Jul 03 '18 at 18:34
  • I tried the regex from @GRUNGER mentoined in the first comment ( `<(a|link).*?href=(\"|')(.+?)(\"|').*?>` and it works like the one I used before, except it selects the whole opening tag instead of to the href part only. Html Agility Pack I've never heard of it before. I'm not sure how to use it right now (never used NuGet as well). But I'll have a look at it, thanks :) – tightDev Jul 03 '18 at 22:12
  • The real html... It should work with any web site, as good as possible, even with malformed (x)html. I tried my code on golem.de as well (one bug found, empty href argument wasn't caught by my code). – tightDev Jul 03 '18 at 22:15

0 Answers0