4

Having a bit of a brain freeze here so I was hoping for some pointers, essentially I need to extract the contents of a specific div tag, yes I know that regex usually isn't approved of for this but its a simple web scraping application where there are no nested div's.

I'm trying to match this:

    <div class="entry">
  <span class="title">Some company</span>
  <span class="description">
  <strong>Address: </strong>Some address
    <br /><strong>Telephone: </strong> 01908 12345
  </span>
</div>

simple vb code is as follows:

    Dim myMatches As MatchCollection
    Dim myRegex As New Regex("<div.*?class=""entry"".*?>.*</div>", RegexOptions.Singleline)
    Dim wc As New WebClient
    Dim html As String = wc.DownloadString("http://somewebaddress.com")
    RichTextBox1.Text = html
    myMatches = myRegex.Matches(html)
    MsgBox(html)
    'Search for all the words in a string
    Dim successfulMatch As Match
    For Each successfulMatch In myMatches
        MsgBox(successfulMatch.Groups(1).ToString)
    Next

Any help would be greatly appreciated.

rink.attendant.6
  • 44,500
  • 61
  • 101
  • 156
Mrk Fldig
  • 4,244
  • 5
  • 33
  • 64
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Richard Jul 03 '12 at 07:50
  • And what's wrong with the regex you're having? It matches your input. – Tim Pietzcker Jul 03 '12 at 07:54
  • Well thats the odd bit its not matching anything on the entire page and theres about 20 of those div's on there – Mrk Fldig Jul 03 '12 at 07:56
  • I know that @Tim has answered this in a much better way than I could, but for your future reference, there is no 2nd group, so `Groups(1)` (which is base-0 index) will always return an empty string... it should be `Groups(0)` – freefaller Jul 03 '12 at 08:09

4 Answers4

8

Your regex works for your example. There are some improvements that should be made, though:

<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>

[^<>]* means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.

.*? (note the ?) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry"> tag in your page.

But your regex itself should still have matched something. Perhaps you're not using it correctly?

I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:

Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
    ResultList.Add(MatchResult.Groups("content").Value)
    MatchResult = MatchResult.NextMatch()
End While

I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div's contents never varies:

<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>

or (behold the joy of multiline strings in VB.NET):

Dim RegexObj As New Regex(
    "<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
    "<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
    "(?<title>.*?)" & chr(10) & _
    "\s*</span>\s*" & chr(10) & _
    "<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
    "<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
    "(?<address>.*?)" & chr(10) & _
    "\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
    "(?<phone>.*?)" & chr(10) & _
    "\s*</span>\s*</div>", 
    RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

(Of course, now you need to store the results for MatchResult.Groups("title") etc...)

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • You my friend are a star! If I wanted to get each element inside that div ie the span class values id just do .*?]*class="title" after the closing > of the div tag? – Mrk Fldig Jul 03 '12 at 08:07
  • 1
    The reason I believe the original code is not picking up, is because it should be `Groups(0)` instead of `Groups(1)` – freefaller Jul 03 '12 at 08:10
  • @MarcFielding: I have edited my answer: The named capturing group `(?.*?)` will capture everything between the `div`s. – Tim Pietzcker Jul 03 '12 at 08:12
  • @freefaller yeah I noticed that one I was actually using a break point and examining the match collection to see if it was picking up anything – Mrk Fldig Jul 03 '12 at 08:16
  • I'm going to mark tim's answer as the correct one although I wouldnt mind knowing how I extract the values of each span so I pull the company name, address and phone number if your feeling energetic Tim? – Mrk Fldig Jul 03 '12 at 08:31
  • @MarcFielding: This is only (reasonably) possible if the spans are always in the same order, and it's going to be messy in any case. Regular expressions are really the wrong tool for this. For example, how can you tell when an address is over? I'll post a (brittle) example regex that will work on your example, but that will likely fail on anything that looks a bit different. – Tim Pietzcker Jul 03 '12 at 08:42
  • Thanks Tim, the components within that div are always in the same order, I could always split them by ">" into an array and do some substringing but I was wondering if there was an easier way. – Mrk Fldig Jul 03 '12 at 08:48
  • Tim thats superb and saved me loads of time if I could do anything more than tick the box I would. Superb! – Mrk Fldig Jul 03 '12 at 08:58
  • @MarcFielding: Great to hear it. A suggestion: [RegexBuddy](http://www.regexbuddy.com) is a great tool for constructing, debugging and learning regexes. Will pay for its price within days in terms of increased productivity. – Tim Pietzcker Jul 03 '12 at 09:03
2

Try using RegexOptions.Multiline instead of RegexOptions.Singleline

Thanks to @Tim for pointing out that the above doesn't work... my bad.

@Tim's answer is a good one, and should be the accepted answer, but an extra part that is stopping your code from working is that there is no 2nd group for Group(1) to return.

Change...

MsgBox(successfulMatch.Groups(1).ToString)

To...

MsgBox(successfulMatch.Groups(0).ToString)
freefaller
  • 19,368
  • 7
  • 57
  • 87
0

use this one

<div.*?class=""entry"".*?>(?<divBody>.*)</div>

and get group named divBody

but be careful this not work if the string contain an other node div (and seems no way to resolve this by regex). for your solution xslt may be useful.

Ria
  • 10,237
  • 3
  • 33
  • 60
  • Careful, this matches *all* div tags (not just those with `class="entry"`), and it matches everything from the very first opening `
    ` to the very last closing `
    `.
    – Tim Pietzcker Jul 03 '12 at 08:02
  • Used (?.*) - not working as Tim said it should match everything but apparently doesn't – Mrk Fldig Jul 03 '12 at 08:05
0

Really good article. Please see the below attached results from eclipse enter image description here