VB.net - How to extract content of HTML using regex?

Question

<div class="gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long" dir="ltr" style="word-break:break-all;">pastebin.com/N8VKGxR9</div>

If I have this, how can I extract only the pastebin url portion in VB.net using regex? I've downloaded the entire webpage using WC.DownloadString().

[I think this is still the most upvoted post on SO](http://stackoverflow.com/q/1732348/1070452) — Ňɏssa Pøngjǣrdenlarp, Dec 23 '16 at 16:23

SouXin · Accepted Answer · 2016-12-23T16:45:20.360

0

 Dim text As String = "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
 Dim pattern As String = "<div[\w\W]+gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long.*>(.*)<\/div>"
 Dim m As Match = r.Match(text)
 Dim g as Group = m.Groups(1)

Will give you pastebin.com/N8VKGxR9

BTW: Topic in the comments for matching special tags, not the text between tags itself. So it's pretty possible.

Edited to keep only divs with these classes

edited Dec 23 '16 at 16:45

answered Dec 23 '16 at 16:29

SouXin

1,565
11
17

The only issue is that I need to extract multiple of these from a larger page and there's other HTML tags. Is there any way I can extract only these tags? – Zach Z Dec 23 '16 at 16:31
If you mean find only divs than yes. – SouXin Dec 23 '16 at 16:37
Not only divs, I need the divs with that specific class name only. How can I regex that out? – Zach Z Dec 23 '16 at 16:39
Which class do you need? All of them? – SouXin Dec 23 '16 at 16:41
All of the ones with this class: "gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long" – Zach Z Dec 23 '16 at 16:43

score 0 · Answer 2 · answered Dec 25 '16 at 18:52

If you use an HTML parser like HtmlAgilityPack (Getting Started With HTML Agility Pack), you can do something like this:

Option Infer On
Option Strict On

Imports HtmlAgilityPack

Module Module1

    Sub Main()
        ' some test data...
        Dim s = "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
        s &= "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/ABC</div>"
        s &= "<div class=""WRONGCLASS gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"

        Dim doc As New HtmlDocument
        doc.LoadHtml(s)

        ' match the classes string /exactly/:
        Dim wantedNodes = doc.DocumentNode.SelectNodes("//div[@class='gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long']")

        ' An alternative for if you want the divs with /at least/ those classes:
        'Dim wantedNodes = doc.DocumentNode.SelectNodes("//div[contains(@class, 'gs-bidi-start-align') and contains(@class, 'gs-visibleUrl') and contains(@class, 'gs-visibleUrl-Long')]")

        ' show the resultant data:
        If wantedNodes IsNot Nothing Then
            For Each n In wantedNodes
                Console.WriteLine(n.InnerHtml)
            Next
        End If

        Console.ReadLine()

    End Sub

End Module

Outputs:

pastebin.com/N8VKGxR9
pastebin.com/ABC

HTML parsers have the advantage that they will generally tolerate malformed HTML - for example, the test data shown above is not a valid HTML document and yet the desired data is parsed from it successfully.

VB.net - How to extract content of HTML using regex?

2 Answers2