1

I am trying to grab the word Juwelier that is before the tag from this HTML page code.

I am not very good with RegEx, and especially not with using it on multiple lines. Thing that will NOT be dynamic :

  • <p>Rubriek:
  • class="category"
  • and ofcourse the html tags like <p> , </p> , <a> , </a>

This is the HTML page code

    <p>Rubriek: 

      <a href="http://www.detelefoongids.nl/juwelier/4-1/?oWhat=Juwelier"
         title="Juwelier"
         class="category">
           Juwelier
      </a>
   </p>
Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
user2001411
  • 51
  • 1
  • 2
  • 7
  • Possible duplication of http://stackoverflow.com/questions/1935918/php-substring-extraction-get-the-string-before-the-first-or-the-whole-strin?rq=1 – samayo Jan 22 '13 at 18:52
  • 1
    @PHPNooB - I don't think so. This is a VB.NET application, so the regex and the code will be different. – JDB Jan 22 '13 at 19:00
  • Please consider using the [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) for tasks like this. Regex are ill-suited (harder to implement and much harder to maintain). – JDB Jan 22 '13 at 19:02
  • possible duplicate of [Get "Title" attribute from html link using Regex](http://stackoverflow.com/questions/853388/get-title-attribute-from-html-link-using-regex) – JDB Jan 22 '13 at 19:19

1 Answers1

0

The Regex below is one among many that you could use.
It uses zero-width positive look-behind (?<=) and look-ahead (?=) assertions to locate the target string.

Dim str As String = _
"<p>Rubriek:" & vbCrLf &
"  <a href=""http://www.detelefoongids.nl/juwelier/4-1/?oWhat=Juwelier""" & vbCrLf &
"     title = ""Juwelier""" & vbCrLf &
"     class=""category"">" & vbCrLf &
"       Juwelier" & vbCrLf &
"  </a>" & vbCrLf &
"</p>"

Dim match As Match = Regex.Match(str, _
    "(?<=<p>Rubriek:[^>]+?class=""category"">\W*)\w+(?=\W*</a>)")

If (match.Success) Then
    MsgBox(match.Value)
End If

Although not used above, an important thing to remember when trying to match over multiple lines is to use Single-line mode if you are going to use the wild-card metacharacter ., so that it matches every character including new-lines. This can be specified using RegexOptions.Singleline or by putting (?s) at the start of the Regex.

\w+ is used to match one or more word characters, i.e. a-zA-Z0-9_
\W* is used to match zero or more non-word characters.
[^>] is used to match characters that are not >.

MikeM
  • 13,156
  • 2
  • 34
  • 47