2

I have some html and want to scrape some data from it.

The HTML is structured in the following way

<div class="someClass"><span class="someOtherClass">Text</span></div>

<table>
  <tbody>
    <tr>
      <td>label</td>
      <td>data</td>
    </tr>
    <tr>
      <td>label</td>
      <td>data</td>
    </tr>
    <tr>
      <td>label</td>
      <td>data</td>
    </tr>
  </tbody>
</table>

<div class="someClass"><span class="someOtherClass">Text</span></div>
      <table>
  <tbody>
    <tr>
      <td>label</td>
      <td>data</td>
    </tr>
    <tr>
      <td>label</td>
      <td>data</td>
    </tr>
    <tr>
      <td>label</td>
      <td>data</td>
    </tr>
  </tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>

I need to be able to scrape the Text value located in the span where class="someOtherClass" (I've already implemented this portion)

I then need to be able to scrape the table directly below the div. Since the "parent" div doesn't actually contain the table, I'm having some issues implementing this.

Eitan Seri-Levi
  • 341
  • 3
  • 17
  • 2
    Your html doesnt seem to be malformed. htmlagilitypack's HTMLDocument should be able to locate the structures you want to extract from its DOM, have you tried that? – James Aug 17 '17 at 19:42
  • If you still want to use regex - please read all posts in https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ first. – Alexei Levenkov Aug 17 '17 at 20:19
  • @EitanSeri-Levi - I edited your post to remove the _regex_ tag and the regex verbage in the post's body. Please accept the edit. Realize though some people only monitor certain tags and title's. Please try to be more careful in the future. And I do believe there are about a million duplicates of Xpath posts. I will mark this as a duplicate when I have the time. Good luck to you !! –  Aug 17 '17 at 23:52

1 Answers1

4

I need to be able to scrape the Text value located in the span

You don't need regex. An Xpath query is enough.

var text = doc.DocumentNode
            .SelectNodes("//span[@class='someOtherClass']")
            .Select(x => x.InnerText)
            .ToList();

I then need to be able to scrape the table directly below the div.

using a similar xpath

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);

var tables = doc.DocumentNode
             .SelectNodes("//span[@class='someOtherClass']/following::table").ToList();
foreach (var table in tables)
{
    var list = table.Descendants("tr")
                    .Select(tr => tr.Descendants("td")
                    .Select(td => td.InnerText).ToList())
                    .ToList();
}
L.B
  • 114,136
  • 19
  • 178
  • 224
  • Handsome solution – Medet Tleukabiluly Aug 17 '17 at 20:35
  • @sln Don't worry. HtmlAgilityPack is very resilliant in parsing malformed htmls :) – L.B Aug 17 '17 at 20:42
  • @sln It should be. I am sure you have already read this famous answer https://stackoverflow.com/a/1732454/932418 – L.B Aug 17 '17 at 20:45
  • @sln Nice, I hope no one needs a change in it. – L.B Aug 17 '17 at 20:52
  • This is just first level general tag parsing (this with invisible content). I've made a SAX parser using this. I also have hundreds of scraper mods to this to find specific data. It never falters on malformed html and is lightning quick. –  Aug 17 '17 at 21:05
  • @sln, I haven't compared the speed of regex and HtmlAgilityPack but let's assume regex is faster. I can live with slower but more readable and maintainable code as a poor soul who are not as good as you in regex. – EZI Aug 17 '17 at 22:03