1

I've been using HtmlAgilityPack for awhile but the web resource I have been working with now has a (seems like) jQuery protocol the browser passes through. What I expect to load is a product page but what actually loads (verified by a WebBrowser control, and a WebClient DownloadString) is a redirect, asking the visitor to select a consultant and sign up with them.

In other words, using Chrome's Inspect >> Elements tool, I get:

<div data-v-1a7a6550="" class="product-extra-images">
  <img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_1MainImage-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
  <img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_2Image2-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">

But WebBrowser and HTMLAgilityPack only get:

<div class="container content">
  <div class="alert alert-danger " role="alert">
    <button type="button" class="close" data-dismiss="alert">
      <span aria-hidden="true">&times;</span>
    </button>
    <h2 style="text-align: center; background: none; padding-bottom: 0;">It looks like you haven't selected a Consultant yet!</h2>
    <p style="text-align: center;"><span>...were you just wanting to browse or were you looking to shop and pick a Consultant to shop under?</span></p>
      <div class="text-center">
        <form action="/just-browsing/" method="POST" class="form-inline">
   ...

After digging into the class definitions in the head, I found the page does use jQuery to handle proper loading, and to handle actions (scrolling, resizing, hovering over images, selecting other images, etc) while the visitor browses the page. Here's from the head of the jQuery:

/*!
* jQuery JavaScript Library v2.1.4
* http://jquery.com/
*
* Includes Sizzle.js
* http://sizzlejs.com/
*
* Copyright 2005, 2014 jQuery Foundation, Inc. and other contributors
* Released under the MIT license
* http://jquery.org/license
*
* Date: 2015-04-28T16:01Z
*/

I tried ScrapySharp as described here: C# .NET: Scraping dynamic (JS) websites

But that just ended up consuming all available memory and never producing anything.

Also this: htmlagilitypack and dynamic content issue Loaded the incorrect redirect as noted above.

I can provide more of the source I'm trying to extract from, including the complete jQuery if needed.

Xero Phane
  • 88
  • 8
  • In theory latests versions of HtmlAgilityPack have a method called LoadFromBrowser that runs a browser in background. So I don't know why you need ScrapySharp. What page are you trying to scrape? – derloopkat Oct 24 '18 at 14:24
  • Thank you for commenting! I believe I've tried LoadFromBrowser but I'll do that now and let you know my results. ScrapySharp was suggested as already answered to my post last night:https://stackoverflow.com/questions/52959806/extract-image-sources-in-c-sharp-from-web-page-using-js?noredirect=1#comment92826729_52959806 but I've learned more about the problem since. Here's an example of what I am trying to scrape: https://paparazziaccessories.com/shop/products/mrs-harry-winston-3368/#/ – Xero Phane Oct 24 '18 at 14:36

1 Answers1

1

Use CaptureRedirect = false; to bypass redirection page. This worked for me with the page you mentioned:

var web = new HtmlWeb();
web.CaptureRedirect = false;
web.BrowserTimeout = TimeSpan.FromSeconds(15);

Now keep trying till seeing the text "Product Description" on the page.

var doc = web.LoadFromBrowser(url, html =>
{
    return html.Contains("Product Description");
});

Latests versions of HtmlAgilityPack can run a browser in background. So we don't really need another library like ScrapySharp for scraping dynamic content.

derloopkat
  • 6,232
  • 16
  • 38
  • 45
  • Awesome suggestion! Unfortunately, I tried it and it timed out. Increasing the timeout to 30 and then 0 (unlimited) still timed out and just kept running without product. System.Exception: 'WebBrowser Execution Timeout Expired. The timeout period elapsed prior to completion of the operation. To avoid this error, increase the WebBrowserTimeout value or set it to 0 (unlimited).' – Xero Phane Oct 24 '18 at 15:44
  • Sometimes this server is running slow or gets stuck. With 15 sec time out, it worked 4 of 7 times in my test. The url was "https://paparazziaccessories.com/shop/products/mrs-harry-winston-3368/#/" – derloopkat Oct 24 '18 at 15:49
  • Would you mind sharing the complete code that worked for you? Sorry I think my efforts are starting to inhibit my thought process haha – Xero Phane Oct 24 '18 at 15:55
  • Nevermind about posting your code. I managed it. you're a complete genius! Thank you! – Xero Phane Oct 24 '18 at 16:02
  • The key to my issue (in case it's useful) was the trailing "/#/" at the end of the url. Without it, the process failed every time. – Xero Phane Oct 24 '18 at 16:10
  • 1
    Ahh, that makes sense because if you don't include that bit it redirects to error page which doesn't contain *"Product Description"*, so it raises time out error. I just tried in my internet browser without "/#/" and page says *"It looks like you haven't selected a Consultant yet!."* – derloopkat Oct 24 '18 at 23:01