1

So I'm trying to scrape the HTML of a website.

private static async Task GetResponseMessageAsync(string filledInUrl, List<HttpResponseMessage> responseMessages)
{
    Console.WriteLine("Started GetResponseMessageAsync for url " + filledInUrl);
    var httpResponseMessage = await _httpClient.GetAsync(filledInUrl);
    await EnsureSuccessStatusCode(httpResponseMessage);
    responseMessages.Add(httpResponseMessage);
    Console.WriteLine("Finished GetResponseMessageAsync for url " + filledInUrl);
}

The url I give is this https://www.rtlnieuws.nl/zoeken?q=philips+fraude

When I right-click -> inspect on that page in the browser I see this: enter image description here

A normal HTML that I can use Xpath to search through.

BUT. When I actually print out what my ResponseMessage contains...

    var htmlDocument = new HtmlDocument(); // this will collect all the search results for a given keyword into Nodes

    var scrapedHtml = await httpResponseMessage.Content.ReadAsStringAsync();
    Console.WriteLine(scrapedHtml); 

... it looks like this: enter image description here

It's a different HTML. Basically it seems like the HTML that the server sends and the one I see in the browser are different. And I can't use my Xpaths to process the response anymore.

I know that my scraper generally works because when I used it on another website where the "server-HTML" and "browser-HTML" were the same it worked.

I wonder what I could do now to translate the "server-HTML" into "browser-HTML"? How does it work? Is there something in the HTMLAgilityPack I could use? I couldn't find anything online probably because "server-HTML" and "browser-HTML" are not the correct terms.

Will be grateful for your help.

miatochai
  • 343
  • 3
  • 15
  • I'm not sure. I think the scraper does what it should. It loads the HTML response. Which is this: view-source:https://www.rtlnieuws.nl/zoeken?q=philips+fraude . BUT I want to see what is interpreted in the browser, which is this https://www.rtlnieuws.nl/zoeken?q=philips+fraude and right-click + inspect. I think the source HTML or whatever it's called is interpreted during a session in the browser, but I'm not sure how to imitate it. – miatochai Aug 08 '22 at 14:11
  • 3
    That page generates its client-side markup dynamically using JavaScript, meaning that you need a fully-fledged rendering engine to get markup as you'd download it in a browser. `HttpClient` won't cut it, you'd need something like Selenium and/or WebView2. – Jeroen Mostert Aug 08 '22 at 14:12
  • Yeah I had a feeling it was about rendering js. So HTMLAgilityPack doesn't have anything for it? :( Jammer. – miatochai Aug 08 '22 at 14:18
  • If you want to observe what @JeroenMostert is mentioning you can try disabling javascript in your browser and browse the cited page. See if it has the same elements as the `HttpClient` response. If it is not then your request might be identified as a bot. You need to arrange your request so it looks like an actual browser request. (Headers are crucial here) If it is the same, you need to use something else like Selenium, Puppeteer Playwright, etc. – Eldar Aug 08 '22 at 14:18
  • 1
    HTMLAgilityPack parses HTML and that's all it does (though it does it quite well). The issue of how to get that HTML is a separate problem and not HAP's concern. – Jeroen Mostert Aug 08 '22 at 14:21

0 Answers0