c# crawling rule didn't work to cnn web site

Question

I'm beginner to C# crawling

I have tried to crawl CNN headlines news from (https://edition.cnn.com/)

But I have failed to get the head line texts.

target is looks like below html (sorry I'm not good at asking questions containing source code, newbie T.T)

<div class="cd__wrapper" data-analytics="_list-hierarchical-xs_article_">
<div class="cd__content">
<h3 class="cd__headline" data-analytics="_list-hierarchical-xs_article_">
<a href="/travel/article/cruise-ship-passengers-stranded-coronavirus/index.html">
<span class="cd__headline-text">At least 30 cruise ships are at sea. Here's what it's like on board.</span><span class="cd__headline-icon cnn-icon"></span></a></h3></div></div>

First I tried to crawl to all html codes then convert to string (my target is get head line text with href link for crawling child pages)

with below c# codes

public async void GetCnnAsync()
    {
        var url = "https://edition.cnn.com/";

        var httpClient = new HttpClient();
        var html = await httpClient.GetStringAsync(url);

        var htmlDocument = new Hp.HtmlDocument();
        htmlDocument.LoadHtml(html);

        var headLineHtmlList = htmlDocument.DocumentNode.Descendants("div")
            .Where(node => node.GetAttributeValue("class", "")
            .Contains("cd__headline")).ToList();


        Console.ReadLine();
    }

but It didn't work just get null headLineHtmlList I don't know why I failed to get result. because chrome page inspecter source have that elements

On the other hand when I tried it to stackoverflow site. I was able to get question list with below codes

public async void GetHtmlAsync()
    {
        var url = "https://stackoverflow.com/questions";

        var httpClient = new HttpClient();
        var html = await httpClient.GetStringAsync(url);

        var htmlDocument = new Hp.HtmlDocument();
        htmlDocument.LoadHtml(html);

        var questionsHtml = htmlDocument.DocumentNode.Descendants("div")
            .Where(node => node.GetAttributeValue("id", "")
            .Equals("questions")).ToList();

        var questionList = questionsHtml[0].Descendants("div")
            .Where(node => node.GetAttributeValue("id", "")
            .Contains("question-summary")).ToList();
    }

It was able to get question list.

Now I really really want to get result from CNN website please help me Thanks in advance

add more test codes

create WebBrowser control
then navigate then get WebBrowser_DocumentCompleted callback

but I didn't get result again

so, I tried it again with documentCompleted but I didn't get it

        WebBrowser webBrowser;
    Control parent;
    WebNewsCallback newsCallback;

    public WebNewsCrawler(Control parent, WebNewsCallback newsCallback) {
        this.parent = parent;
        this.newsCallback = newsCallback;
        if (webBrowser == null) {
            webBrowser = new WebBrowser {
                Visible = false,
                ScriptErrorsSuppressed = true
            };
        }
        parent.Controls.Add(webBrowser);
        webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted;
    }

    public void doWork(string address) {
        webBrowser.Navigate(address);
    }

    int count = 0;

    private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) {
        if (webBrowser.ReadyState != WebBrowserReadyState.Complete) return;
        newsCallback(webBrowser.DocumentStream);
        GetCnn(webBrowser.DocumentStream);
        Console.WriteLine(count.ToString());
        count++;
    }

    public void GetCnn(Stream stream) {
        var doc = new Hp.HtmlDocument();
        doc.Load(stream, Encoding.UTF8);

        var nodes = doc.DocumentNode.SelectNodes("/html/body/div[7]/section[2]/div[2]/div/div[1]/ul/li[4]/article/div/div/h3/a/span[1]");
        if(nodes != null) {
            Console.WriteLine("xpath nodes not null");
        }

        var headLineHtmlList = doc.DocumentNode.Descendants("h3").ToList();                
        if (headLineHtmlList != null) {
            Console.WriteLine("headLineCount " +headLineHtmlList.Count.ToString());
        }
    }

headLineCount is 0 and xPath result is zero(xpath or xpath full path same result)

Because CNN uses JavaScript rendering, probably React or similar. — Ian Kemp, Mar 21 '20 at 13:47
... so you need a headless webbrowser. `HttpClient()` cannot fetch complete dynamic pages. — Jimi, Mar 21 '20 at 13:55
@Jimi so I tried it again with headless webbrowser if (webBrowser == null) { webBrowser = new WebBrowser { Visible = false, ScriptErrorsSuppressed = true }; } parent.Controls.Add(webBrowser); webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted; but it failed to get result again. — Seonghyun Kim, Mar 21 '20 at 15:51
CompletedEventArgs e) { ThreadPool.QueueUserWorkItem(new WaitCallback(myAsyncOperation)); } void myAsyncOperation(Object state) { Thread.Sleep(5000); if (webBrowser.InvokeRequired) { webBrowser.Invoke(new Action(delegate () { newsCallback(webBrowser.DocumentStream); GetCnn(webBrowser.DocumentStream); })); } } but I didnt get result T.T — Seonghyun Kim, Mar 21 '20 at 15:53
Remove all threading related stuff. Plus, you don't need to show the control in a UI: you can initialize a **headless** (no UI) WebBrowser class. In the `DocumentCompleted` handler, you check `if ([WebBrowser].ReadyState != WebBrowserReadyState.Complete) return;` etc. Read the notes [here](https://stackoverflow.com/a/60741246/7444103). See the section related to the HtmlCodument's Frames/IFrames. + Many examples around — Jimi, Mar 21 '20 at 16:10
@Jimi Great thanks I will try it with https://stackoverflow.com/questions/53213782/how-to-get-an-htmlelement-value-inside-frames-iframes/53218064#53218064 — Seonghyun Kim, Mar 21 '20 at 17:05
@Jimi I have trying with if ([WebBrowser].ReadyState != WebBrowserReadyState.Complete) return; but I had get same result. xPath getting failed, headLineResult is zero.. hmm I attach my full code to above question body — Seonghyun Kim, Mar 22 '20 at 01:06
Have you read the part where it's stated that *the DocumentCompleted event can and will be raised multiple times*? What is that `WebNewsCallback` doing there? Then, don't use the DocumentStream, use the Document, check whether the HtmlElement is there; if it's not then the Document part you're looking for is not ready yet. Do parse it only when you find the HtmlElement inside it. You can use the common methods (`[WebBrowser].Document.GetElementById()`, `GetElementsByTagName()` etc.) to test for that. Then, eventually, use your parser (HtmlAgilityPack?) when the test is positive. — Jimi, Mar 22 '20 at 02:04
You have to test the Document of each IFrame. Read the notes I've already posted. It's all there. — Jimi, Mar 22 '20 at 02:07
Yes I was . I got multiple documents complete events then test results okay I will test it with documents Thanks — Seonghyun Kim, Mar 22 '20 at 06:49

score 0 · Answer 1 · answered Apr 09 '20 at 19:02

Are you sure that the selector you're using is correct? You said:

var headLineHtmlList = htmlDocument.DocumentNode.Descendants("div") .Where(node => node.GetAttributeValue("class", "") .Contains("cd__headline")).ToList();

Isn't that saying "Give me all the descendants with tag <div> and a CSS class of cd__headline"?

But you're not looking for a div with a class of cd__headline. You're looking for <a> tags occurring inside <h3> tags that have a CSS class of cd__headline.

I could be wrong, but if I'm right it would be an easy fix! Good luck.

c# crawling rule didn't work to cnn web site

1 Answers1