0

Data:

<tr>
<td>
<a href="somelink">
some. .data...
</a>
</td>
<td>Black</td>
<td>57234</td>
<td>5431.60</td>
<td><font class="down">  -125.02</font></td>
</tr><tr>
<td>
<a href="somelink">
some. .data...
</a>
</td>
<td>Blue</td>
<td>57234</td>
<td>5431.60</td>
<td><font class="up">  -125.02</font></td>
</tr><tr>
<td>
<a href="somelink">
some. .data...
</a>
</td>
<td>Brown</td>
<td>57234</td>
<td>5431.60</td>
<td><font class="down">  -125.02</font></td>
</tr>
...more data...

I want to extract 'some. .data...'; 'Black'; '57234'; '5431.60'; at one time. [fifth td data is not required.]

Initially,

<tr><td><a.*>([a-zA-Z0-9 -]+)</a></td><td>(\w+)</td><td>([\d]+\.\d+)</td><td>(\d+\.\d+)</td>

was working. (via hit and miss approach)

But, now it's broke.

Now, when I use <td>(.*)</td> or <\w+>(.*)</\w+> : it shows data from last four tds in every tr. But then, Why won't it show <a href...>...</a> and how can I get data I want?

2 Answers2

6

Regex is, in general, a bad way to parse HTML.

I suggest taking a look at the HTML Agility Pack or CsQuery that are purpose built HTML parsers for .NET.

The HTML Agility Pack can be queried using XPath and LINQ, and CsQuery uses jQuery selectors.

carla
  • 1,970
  • 1
  • 31
  • 44
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • Good answer. I'd just like to vouch for CsQuery, it is a lot more modern, a lot faster, and a lot nicer than HtmlAgilityPack. – Benjamin Gruenbaum Jan 20 '13 at 18:41
  • What's broke? I do understand the futility of my 'parsing HTML with regex' but as a curious exercise, how can I do so? –  Jan 20 '13 at 18:45
  • 1
    @AnubhavSaini - The futility of parsing HTML with regex turns a lot of people off from even attempting it. So I'm not sure you're going to get an answer. The thing is that even if you get it to work, it's always going to be a fragile solution. – Steve Wortham Jan 20 '13 at 19:29
1

If you used a real html parser, your code would be simpler and easier to maintain

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var table = doc.DocumentNode.Descendants("tr")
           .Select(tr => tr.Descendants("td").Select(td => td.InnerText).ToList())
           .ToList();

Given the sample html you provided, above code will return 3 rows each containing 5 columns.

I4V
  • 34,891
  • 6
  • 67
  • 79
  • didn't work. I think `HtmlWeb hweb = new HtmlWeb(); doc= hweb.Load(url, "get");` is what does it. –  Mar 14 '13 at 06:59