-1

I am experimenting with web scraping and I am having trouble scraping a particular value out of some nested div classes. I am using the .NET HtmlAgilityPack class library in a .NET Framework C# Console App. Here is the div code:

<div class="ds-nearby-schools-list">
    <div class="ds-school-row">
        <div class="ds-school-rating">
            <div class="ds-gs-rating-8">
                <span class="ds-hero-headline ds-schools-display-rating">8</span>
                <span class="ds-rating-denominator ds-legal">/10</span>
            </div>
        </div>
        <div class="ds-nearby-schools-info-section">
            <a class="ds-school-name ds-standard-label notranslate" href="https://www.greatschools.org/school?id=00870&amp;state=MD" rel="nofollow noopener noreferrer" target="_blank">Candlewood Elementary School</a>
            <ul class="ds-school-info-section">
                <li class="ds-school-info">
                    <span class="ds-school-key ds-body-small">Grades:</span>
                    <span class="ds-school-value ds-body-small">K-5</span>
                </li>
                <li class="ds-school-info">
                    <span class="ds-school-key ds-body-small">Distance:</span>
                    <span class="ds-school-value ds-body-small">0.8 mi</span>
                </li>
            </ul>
        </div>
    </div>
</div>

I want to scrape the "8" from the ds-hero-headline ds-schools-display-rating class. I am having trouble formulating the selector for the SelectNodes method on the DocumentNode object of the HtmlNode.HtmlDocument class.

Ari
  • 45
  • 4
Michael
  • 23
  • 6

3 Answers3

0

I guess you might be having a trouble to write XPath to select the node. Try //*[contains(@class, 'ds-hero-headline') and contains(@class, 'ds-schools-display-rating')] with SelectNodes method.

However, this XPath could have a problem if the page your targeting would also have class name like ds-hero-headline-content, which ds-hero-headline can partially match. In that case, see the solution in How can I find an element by CSS class with XPath?

Yas Ikeda
  • 973
  • 1
  • 9
  • 16
0

I would use this to extract 0.8 mi

//div[@class='ds-nearby-schools-list']/div[@class='ds-school-row']/div[@class='ds-nearby-schools-info-section']/ul[@class='ds-school-info-section']/li[@class='ds-school-info']/span[@class='ds-school-value ds-body-small' and preceding-sibling::span[@class='ds-school-key ds-body-small' and text()='Distance:']]/text()

Then this regex to group data:

^[0-9\.]+ (.*)$

At the end you can use some kind of conversion to save distance to an object.

dafie
  • 951
  • 7
  • 25
0

Have you tried the following to get the 8. You can search for a specific span element with the class name to get the inner text.

Note: I used text file to load the html from your question.

    string htmlFile = File.ReadAllText(@"TempFile.html");
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(htmlFile);
    HtmlNode htmlDoc = doc.DocumentNode;

    HtmlNode node = htmlDoc.SelectSingleNode("//span[@class='ds-hero-headline ds-schools-display-rating']");
    Console.WriteLine(node.InnerText);

    // output: 8

Alternate: Another way is to specify the path that you want the value from, starting from the div element.

    HtmlNode node2 = htmlDoc.SelectSingleNode("//div[@class='ds-gs-rating-8']//span[@class='ds-hero-headline ds-schools-display-rating']");
    Console.WriteLine(subNode.InnerText);

output

8
Jawad
  • 11,028
  • 3
  • 24
  • 37