0

I really have serious problems with regex. I need to get all text between 2 strings, in this case that strings are <span class="user user-role-registered-member"> and </span>.

I googled pretty much questions (some of them are on StackOverFlow), and watched YouTube tutorials, still can't get it.

This is the code that i think would work, but i don't know why it doesn't.

Dim mystring As String = "<br>Terms of Service<br></br>Developers<br>"

Dim pattern1 As String = "(?<=<br>)(.*?)(?=<br>)"
Dim pattern2 As String = "(?<=</br>)(.*)(?=<br>)"

Dim m1 As MatchCollection = Regex.Matches(mystring, pattern1)
Dim m2 As MatchCollection = Regex.Matches(mystring, pattern2)
MsgBox(m1(0).ToString)
MsgBox(m2(0).ToString)

Ok, so this code works pretty well....with <br>. I tried to change pattern1 and pattern2's <br> with span but it doesn't work. I know that i am making a mistake here, but i don't know where/how.

Any answer will be really helpful.

Stefan Đorđević
  • 565
  • 1
  • 4
  • 22

5 Answers5

3

You can also do it with XML:

Dim s As String = "<span class=""user user-role-registered-member"">Keyboard</span>"
Dim doc As New System.Xml.XmlDocument
doc.LoadXml(s)
Console.WriteLine(doc.FirstChild.InnerText) ' Outputs: "Keyboard"

There are reasons given for not trying to parse HTML with regexes at RegEx match open tags except XHTML self-contained tags.

Community
  • 1
  • 1
Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
2

Thank you very much for answers. I found answer by myself (thanks to Evil Tak i got an idea).

Dim findtext As String = "(?<=<span class=""user user-role-registered-member"">)(.*?)(?=</span>)"
Dim myregex As String = "<span class=""user user-role-registered-member"">Keyboard</span>"
Dim doregex As MatchCollection = Regex.Matches(myregex, findtext)
MsgBox(doregex(0).ToString)

StackOverFlow is so powerful...♥

Stefan Đorđević
  • 565
  • 1
  • 4
  • 22
1

Use Explicit capture groups. The following should do the job:

Dim exp = "<span class=""user user-role-registered-member"">(?<GRP>.*)</span>"
Dim M = System.Text.RegularExpressions.Regex.Match(YourInputString, exp, System.Text.RegularExpressions.RegexOptions.ExplicitCapture)
If M.Groups("GRP").Value <> "" Then
  Return M.Groups("GRP").Value
End If
dotNET
  • 33,414
  • 24
  • 162
  • 251
1

This does the job easily and beautifully. It won't return a match when there is no text inside the span, so you do not need to worry about empty matches. It will however return groups with only whitespace in them.

<span class=""user user-role-registered-member"">(.+)</span>

Test it out here.

EvilTak
  • 7,091
  • 27
  • 36
  • 1
    @Stefan Đorđević beware, this will fail if there are multiple closing span tags..make it non greedy using `.+?` – rock321987 Jun 04 '16 at 14:00
  • Easy for you, not readable for others. `Xml` approach is most clear, easy and readable, sorry – Fabio Jun 04 '16 at 17:52
  • @Fabio all I did was answer the question. The question asked me for a Regex, and I gave one. I'd have gone for the XML parser route too, but I knew that *someone* would mention it and decided to answer the question instead. – EvilTak Jun 05 '16 at 05:47
0

Your text is xml, so why to hack a strings with Regex if you can do it in readable and clear way.
With LINQ to XML

Dim htmlPage As XDocument = XDocument.Parse(downloadedHtmlPage)

Dim className As String = "user user-role-registered-member"
Dim value As String = 
    htmlPage.Descendants("span").
    Where(Function(span) span.Attribute("class").Value.Equals(className)).
    FirstOrDefault().
    Value

And with Accessing XML in Visual Basic

Dim htmlPage As XDocument = XDocument.Parse(downloadedHtmlPage)

Dim className As String = "user user-role-registered-member"
Dim value As String = 
    htmlPage...<span>.
    Where(Function(span) span.@class.Value.Equals(className)).
    FirstOrDefault().
    Value
Fabio
  • 31,528
  • 4
  • 33
  • 72