I'm beginner to C# crawling
I have tried to crawl CNN headlines news from (https://edition.cnn.com/)
But I have failed to get the head line texts.
target is looks like below html (sorry I'm not good at asking questions containing source code, newbie T.T)
<div class="cd__wrapper" data-analytics="_list-hierarchical-xs_article_">
<div class="cd__content">
<h3 class="cd__headline" data-analytics="_list-hierarchical-xs_article_">
<a href="/travel/article/cruise-ship-passengers-stranded-coronavirus/index.html">
<span class="cd__headline-text">At least 30 cruise ships are at sea. Here's what it's like on board.</span><span class="cd__headline-icon cnn-icon"></span></a></h3></div></div>
First I tried to crawl to all html codes then convert to string (my target is get head line text with href link for crawling child pages)
with below c# codes
public async void GetCnnAsync()
{
var url = "https://edition.cnn.com/";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new Hp.HtmlDocument();
htmlDocument.LoadHtml(html);
var headLineHtmlList = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Contains("cd__headline")).ToList();
Console.ReadLine();
}
but It didn't work just get null headLineHtmlList I don't know why I failed to get result. because chrome page inspecter source have that elements
On the other hand when I tried it to stackoverflow site. I was able to get question list with below codes
public async void GetHtmlAsync()
{
var url = "https://stackoverflow.com/questions";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new Hp.HtmlDocument();
htmlDocument.LoadHtml(html);
var questionsHtml = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("id", "")
.Equals("questions")).ToList();
var questionList = questionsHtml[0].Descendants("div")
.Where(node => node.GetAttributeValue("id", "")
.Contains("question-summary")).ToList();
}
It was able to get question list.
Now I really really want to get result from CNN website please help me Thanks in advance
add more test codes
create WebBrowser control
then navigate then get WebBrowser_DocumentCompleted callback
but I didn't get result again
so, I tried it again with documentCompleted but I didn't get it
WebBrowser webBrowser;
Control parent;
WebNewsCallback newsCallback;
public WebNewsCrawler(Control parent, WebNewsCallback newsCallback) {
this.parent = parent;
this.newsCallback = newsCallback;
if (webBrowser == null) {
webBrowser = new WebBrowser {
Visible = false,
ScriptErrorsSuppressed = true
};
}
parent.Controls.Add(webBrowser);
webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted;
}
public void doWork(string address) {
webBrowser.Navigate(address);
}
int count = 0;
private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) {
if (webBrowser.ReadyState != WebBrowserReadyState.Complete) return;
newsCallback(webBrowser.DocumentStream);
GetCnn(webBrowser.DocumentStream);
Console.WriteLine(count.ToString());
count++;
}
public void GetCnn(Stream stream) {
var doc = new Hp.HtmlDocument();
doc.Load(stream, Encoding.UTF8);
var nodes = doc.DocumentNode.SelectNodes("/html/body/div[7]/section[2]/div[2]/div/div[1]/ul/li[4]/article/div/div/h3/a/span[1]");
if(nodes != null) {
Console.WriteLine("xpath nodes not null");
}
var headLineHtmlList = doc.DocumentNode.Descendants("h3").ToList();
if (headLineHtmlList != null) {
Console.WriteLine("headLineCount " +headLineHtmlList.Count.ToString());
}
}
headLineCount is 0 and xPath result is zero(xpath or xpath full path same result)