0

The input string is:

<input type="hidden" name="locale" value="us">

The regex pattern is:

Dim r As New Regex("<input\s{0,}(?:(name|type|value)=""([^""]+)""\s{0,})+>")

The code being used:

        If r.IsMatch(s) Then
            For Each m As Match In r.Matches(s)
                Debug.Print(m.ToString)
                For i As Integer = 0 To m.Groups.Count - 1
                    Debug.Print(New String(" "c, i + 1) & "-" & m.Groups(i).Value)
                Next
            Next
        End If

The output:

<input type="hidden" name="locale" value="us">
 -<input type="hidden" name="locale" value="us">
  -value
   -us

I would expect it to match:

-type
-hidden
-name
-locale
-value
-us

The alternate pattern used goes by the order it is provided in, perhaps that's why it's only spitting out one group, which is the last match.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Data
  • 23
  • 3
  • 3
    [Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) – Biffen Nov 09 '16 at 06:18
  • I've heard similar disputes before. Just because it's difficult, I refuse to believe there isn't a regex junkie that can tackle this. – Data Nov 09 '16 at 06:29
  • It's not a matter of being *difficult*: It's that HTML is so complex that a *proper* regex would be *huge*. – Biffen Nov 09 '16 at 06:30
  • I only want to match this string, not a whole HTML page. To help your point, I know this could easily be parsed out using .IndexOf and .Substring, etc. – Data Nov 09 '16 at 06:32
  • Yeah. And then it changes the quotes to `'`. Or adds some whitespace around the `=`. Or a value-less attribute shows up. Do you see where I'm going? – Biffen Nov 09 '16 at 06:33
  • Gotcha, point taken. A complex pattern it would be but not impossible. I still have faith. – Data Nov 09 '16 at 06:35
  • Your presumption is wrong, your regex matches all the groups. Not only does it match the groups, but also *captures*. – Wiktor Stribiżew Nov 09 '16 at 07:44

1 Answers1

0

It is not a good idea to parse HTML data with regex. Use HtmlAgilityPack or similar libraries that are meant to do this. See How do you parse an HTML in vb.net.

Answering your question, you do not access the captures that are all stored in the capture collection in each group. Here is a simple snippet showing how to obtain your desired result using the same regex:

Imports System
Imports System.Text.RegularExpressions

Public Class Test
    Public Shared Sub Main()
        Dim r As New Regex("<input\s{0,}(?:(name|type|value)=""([^""]+)""\s{0,})+>")
        Dim s As String
        s = "<input type=""hidden"" name=""locale"" value=""us"">"
        If r.IsMatch(s) Then
            For Each m As Match In r.Matches(s)
                Console.WriteLine(m.ToString)
                For j As Integer = 0 To m.Groups(1).Captures.Count - 1      ' Number of captures in Capture stack 1 (same will be in the second one)
                    Console.WriteLine(" -" & m.Groups(1).Captures(j).Value) ' Print the 1st group captures
                    Console.WriteLine(" -" & m.Groups(2).Captures(j).Value) ' Print the 2nd group captures
                Next
            Next
        End If
    End Sub
End Class

Output:

<input type="hidden" name="locale" value="us">
 -type
 -hidden
 -name
 -locale
 -value
 -us

See the VB.NET demo

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563