-2

I have Bing html and i want to parse the results from it with :

    string BingRegex = "<div class=\"sb_tlst\"><h3><a href=\"(.*?)\"";
    string[] results = Regex.Matches(responseStr, BingRegex).Cast<Match>().Select(m => m.Value).ToArray();

I get the results to the array but it add the pattern to each result , something like :

<div class=\"sb_tlst\"><h3><a href=\"www.cnn.com\"
<div class=\"sb_tlst\"><h3><a href=\"www.google.com\"
<div class=\"sb_tlst\"><h3><a href=\"www.gmail.com\"

Any idea how can i fix this and get only the url?

John Saunders
  • 160,644
  • 26
  • 247
  • 397
YosiFZ
  • 7,792
  • 21
  • 114
  • 221
  • 4
    You shouldn't use regex to parse html. – gleng Dec 18 '13 at 15:16
  • 4
    See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – John Saunders Dec 18 '13 at 15:16
  • 1
    well you can, but it will go wrong on you quickly. – Tony Hopkinson Dec 18 '13 at 15:17
  • 4
    Questions like this come up every so often and the response is often the same... don't! Try using something like the HTML Agility pack http://htmlagilitypack.codeplex.com/ – Liath Dec 18 '13 at 15:17
  • Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Cole Tobin Oct 13 '22 at 14:15

2 Answers2

5

I would suggest not to use regex to parse HTML. Use HtmlAgilityPack as suggested here. Then just use XPath to get the value of attribute you need.

The XPath for your sample div

<div class="sb_tlst">
    <h3>
        <a href="www.gmail.com"/>
    </h3>
</div>

would be

/div[@class='sb_tlst']/h3/a/@href
carla
  • 1,970
  • 1
  • 31
  • 44
Pavel K
  • 3,541
  • 2
  • 29
  • 44
2

Aside from doing this with an HTML parser (which is a better idea), replace:

Select(m => m.Value)

with:

Select(m => m.Value.Groups[1].Value)

Although you'll probably want to throw in a little error handling to check that the group is actually populated.

But the best solution is not to use Regex or an HTML parser, but instead use the Bing search API because this is exactly what it's for.

Matt Burland
  • 44,552
  • 18
  • 99
  • 171