4

So I'm writing an application that will do a little screen scraping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this:

<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td> 
The data I want is in here <br /> 
and it's seperated by these annoying <br /> 's.

No id's, classes, or even a single <p> tag. </p> Just a bunch of <br />  tags.
</td> 
</tr> 
</table> 

So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?

Update: Here is how I'm loading my doc

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);
Widor
  • 13,003
  • 7
  • 42
  • 64
Bob Dylan
  • 4,393
  • 9
  • 40
  • 58
  • Is there only one table in your document? If not, how will you locate the table you are interested in? – Mark Byers Jun 12 '10 at 05:38
  • @Mark: Based on the `cellspacing="3"` attribute. I understand this sounds *hacky* (and thats because it is), but no other table in the 1000+ documents contains a cellspacing attribute at all. This isn't production code, just a project I'm running to collect some data. – Bob Dylan Jun 12 '10 at 05:43
  • Your title and question disagree. Title: `How can I get all content within
    tags` Question: `So I just need to get the data within the 2nd row.` Which is it? Can you fix it so that the title and question match?
    – Mark Byers Jun 12 '10 at 05:52

5 Answers5

3

Since you are using Html Agility Pack already I would suggest using the methods it provides to find the information you want. There are a few ways to navigate the document, but one of the most concise is to use XPath. In this case you could use something like this:

HtmlDocument doc = new HtmlDocument();
doc.Load("input.html");
HtmlNode node = doc.DocumentNode
                   .SelectNodes("//table[@cellspacing='3']/tr[2]/td")
                   .Single();
string text = node.InnerText;
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • I think your on the right track, but I'm not seeing the `.Single()` method in intellisense. I'm using version 1.4.0 of the HTML Agility Pack. – Bob Dylan Jun 12 '10 at 05:56
  • 1
    Add a reference to and use System.Data.Linq; – alexn Jun 12 '10 at 05:57
  • @Bob Dylan: That code was just an example. You don't *have* to use `Single()` if you don't have it available - you could just write `.SelectNodes(...)[0]` instead. Though knowing about Linq would be a huge asset for developing in C#. – Mark Byers Jun 12 '10 at 06:02
  • @Mark: Ok I just tried using the `[0]` like you said and got an exception: `node`: "Object reference not set to an instance of an object". I assume this means it didn't find the table, tr, or the td? – Bob Dylan Jun 12 '10 at 06:04
  • @Bob Dylan: Correct. You could change the XPath expression to "//table[@cellspacing=3]" and see if that matches. – Mark Byers Jun 12 '10 at 06:07
  • @Mark: I tried that and it gave me the same error. Also I've updated my answer to show how I'm loading the document (just in case that makes a difference). – Bob Dylan Jun 12 '10 at 06:14
1

"Something else" is the best answer -- HTML is best parsed by an HTML parser rather than via regular expressions. I'm no C# expert, but I hear the HTML Agility Pack is well-liked for this purpose.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
1

I'd say som̡et̨hińg Else

Community
  • 1
  • 1
FelipeAls
  • 21,711
  • 8
  • 54
  • 74
  • Normally I would agree with that too, but I think this is an exception becuase I'm looking for something so narrow. However if you could **actully suggest something else** I would be open to that too. – Bob Dylan Jun 12 '10 at 05:36
0

You'd probably get better mileage with an xml parser.

Josh Sterling
  • 838
  • 7
  • 12
0

If you're using the Agility pack already, then it's just a matter of using some thing doc.DocumentNode.SelectNodes("//table[@cellspacing='3']") to get the table in the document. Try looking through the documentation and coding examples. Since you already have structured data, it's rediculous to go back to the text data and reparse.

Eclipse
  • 44,851
  • 20
  • 112
  • 171