0

I have the following HTML string:

<a href="/tothepage" title="the page">The Link</a>.  

How can I extract title from the HTML snippet with ease? Either a regex or other VB.NET solution is preferred but C# is ok.

I want 'the page' not the link text: I want the value of the title attribute.

I have HTMLAgilityPack installed if it's easy to do it with that.

Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
MiscellaneousUser
  • 2,915
  • 4
  • 25
  • 44

2 Answers2

2

As you have the HtmlAgilityPack already, you can extract the "title" attribute like this:

Option Infer On
Option Strict On

Imports HtmlAgilityPack

Module Module1

    Sub Main()
        Dim a = "<a href=""/tothepage"" title=""the page"">The Link</a>."
        Dim doc As New HtmlDocument()
        doc.LoadHtml(a)
        Dim node = doc.DocumentNode.SelectSingleNode("/a")
        Dim title = node?.Attributes("title")?.Value

        Console.WriteLine(title) ' outputs "the page"

        Console.ReadLine()

    End Sub

End Module

Of course, you won't need that many lines of code as that is a complete working example.

The ?. parts prevent it from throwing an error if node is Nothing (in this case if there wasn't an "<a>" element) and prevent it from throwing an error if there is no "title" attribute.

Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
  • It never transpired to me that I needed to treat the hyperlink string as a document. Thx, I got what I needed. – MiscellaneousUser Nov 23 '16 at 20:25
  • 1
    @MiscellaneousUser While you *can* parse small amounts of some HTML with regexes, it is usually not a good idea - reasons for that are given in the quite amusing post [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/a/1732454/1115360). – Andrew Morton Nov 23 '16 at 20:29
1

With a regular expression, the group will contain it ([^"]*):

title="([^"]*)"

C#

using System.Text.RegularExpressions;
static void Main(string[] args)
    {
        string originalString = "<a href=\" / tothepage\" title=\"the page\">The Link</a>.";
        Regex rgx = new Regex("title=\"([^\"]*)\"", RegexOptions.IgnoreCase);
        Match match = rgx.Matches(originalString)[0];
        Console.WriteLine(match.Groups[1]);
        Console.ReadLine();
    }
blaze_125
  • 2,262
  • 1
  • 9
  • 19