0

UPDATE #2 Continuing on this effort (see original and update #1 below). ScrapySharp had potential but no matter what I tried, the process consumed all available memory and didn't produce anything. I did find that, due to jQuery, in a test WebBrowser control, the correct web page does not load. It seems the site has a function that determines how you got to the page requested, validates something about your browser, and redirects you to a generic sign-up page.

Thoughts on how to appease the gate keeper?

UPDATE #1 (details of original, not unique question below): First of all - THANK YOU! For the suggestions!

I tried to use ScrapySharp as @John pointed out as:

    ScrapingBrowser Browser = new ScrapingBrowser();
    WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
    HtmlNode rawHTML = PageResult.Html;
    Console.WriteLine(rawHTML.InnerHtml);
    Console.ReadLine();

However, it resulted in a memory leak. To get a sense of how it works, I tried:

            ScrapingBrowser Browser = new ScrapingBrowser();
            WebPage PageResult = Browser.NavigateToPage(new Uri(url));
            HtmlNode rawHTML = PageResult.Html;
            var imgNodes = rawHTML.SelectNodes("//img");

Which also created a memory leak. What am I missing with my implementation of it?

ORIGINAL QUESTION: I'm trying to get my application to grab specific images from a web site. So far I've been using HtmlAgilityPack but it only grabs the basic HTML. I don't know how to explain the tags from the missing elements other than they show up when using Inspect in Chrome (but Regex and HtmlAgilityPack can't seem to access/see them), and they have a "data-v-??????" identifier inside the tag. Here's an example:

<div data-v-1a7a6550="" class="product-extra-images"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_1Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_2Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"></div>

Please let me know if you need additional details. Here's a sample of my latest code that couldn't extract the elements (in case it helps):

var htmlDoc = new HtmlAgilityPack.HtmlDocument()
                {
                    OptionFixNestedTags = true,
                    OptionAutoCloseOnEnd = true,
                    OptionReadEncoding = false
                };

                var imgNodes = htmlDoc.DocumentNode.SelectNodes("//div");

                foreach (var imgNode in imgNodes)
                {
                    //decode the string
                    var img = HttpUtility.HtmlDecode(imgNode.InnerText).Trim();

                    imagesouces.Add(img);
                }

                File.WriteAllLines(@"C:\Users\user\Desktop\WriteText.txt", imagesouces);
Xero Phane
  • 88
  • 8
  • I tried the suggestion by @HereticMonkey first since that was the easiest to implement. It still couldn't grab the data. I'll try the one provided by John next. Thank you both! – Xero Phane Oct 24 '18 at 01:57
  • Updated question with additional details and develpments noted and resolved here (in case anyone later needs to resolve a similar issue): https://stackoverflow.com/questions/52971088/c-sharp-scrape-correct-web-content-following-jquery/52972739?noredirect=1#comment92852730_52972739 – Xero Phane Oct 24 '18 at 16:04

0 Answers0