parse HTML with Regex

Question

I have Bing html and i want to parse the results from it with :

    string BingRegex = "<div class=\"sb_tlst\"><h3><a href=\"(.*?)\"";
    string[] results = Regex.Matches(responseStr, BingRegex).Cast<Match>().Select(m => m.Value).ToArray();

I get the results to the array but it add the pattern to each result , something like :

<div class=\"sb_tlst\"><h3><a href=\"www.cnn.com\"
<div class=\"sb_tlst\"><h3><a href=\"www.google.com\"
<div class=\"sb_tlst\"><h3><a href=\"www.gmail.com\"

Any idea how can i fix this and get only the url?

See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — John Saunders, Dec 18 '13 at 15:16
Questions like this come up every so often and the response is often the same... don't! Try using something like the HTML Agility pack http://htmlagilitypack.codeplex.com/ — Liath, Dec 18 '13 at 15:17
Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Cole Tobin, Oct 13 '22 at 14:15

score 5 · Answer 1 · edited Dec 07 '17 at 07:34

5

I would suggest not to use regex to parse HTML. Use HtmlAgilityPack as suggested here. Then just use XPath to get the value of attribute you need.

The XPath for your sample div

<div class="sb_tlst">
    <h3>
        <a href="www.gmail.com"/>
    </h3>
</div>

would be

/div[@class='sb_tlst']/h3/a/@href

edited Dec 07 '17 at 07:34

carla

1,970
1
31
44

answered Dec 18 '13 at 15:17

Pavel K

3,541
2
29
44

Matt Burland · Accepted Answer · 2013-12-18T16:38:16.743

2

Aside from doing this with an HTML parser (which is a better idea), replace:

Select(m => m.Value)

with:

Select(m => m.Value.Groups[1].Value)

Although you'll probably want to throw in a little error handling to check that the group is actually populated.

But the best solution is not to use Regex or an HTML parser, but instead use the Bing search API because this is exactly what it's for.

edited Dec 18 '13 at 16:38

answered Dec 18 '13 at 15:22

Matt Burland

44,552
18
99
171

parse HTML with Regex

2 Answers2