UPDATE #2 Continuing on this effort (see original and update #1 below). ScrapySharp had potential but no matter what I tried, the process consumed all available memory and didn't produce anything. I did find that, due to jQuery, in a test WebBrowser control, the correct web page does not load. It seems the site has a function that determines how you got to the page requested, validates something about your browser, and redirects you to a generic sign-up page.
Thoughts on how to appease the gate keeper?
UPDATE #1 (details of original, not unique question below): First of all - THANK YOU! For the suggestions!
I tried to use ScrapySharp as @John pointed out as:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();
However, it resulted in a memory leak. To get a sense of how it works, I tried:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri(url));
HtmlNode rawHTML = PageResult.Html;
var imgNodes = rawHTML.SelectNodes("//img");
Which also created a memory leak. What am I missing with my implementation of it?
ORIGINAL QUESTION: I'm trying to get my application to grab specific images from a web site. So far I've been using HtmlAgilityPack but it only grabs the basic HTML. I don't know how to explain the tags from the missing elements other than they show up when using Inspect in Chrome (but Regex and HtmlAgilityPack can't seem to access/see them), and they have a "data-v-??????" identifier inside the tag. Here's an example:
<div data-v-1a7a6550="" class="product-extra-images"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_1Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"><img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/34160_2Earring5th-5-18_1.jpg.100x100_q85_crop_upscale.jpg" width="50"></div>
Please let me know if you need additional details. Here's a sample of my latest code that couldn't extract the elements (in case it helps):
var htmlDoc = new HtmlAgilityPack.HtmlDocument()
{
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true,
OptionReadEncoding = false
};
var imgNodes = htmlDoc.DocumentNode.SelectNodes("//div");
foreach (var imgNode in imgNodes)
{
//decode the string
var img = HttpUtility.HtmlDecode(imgNode.InnerText).Trim();
imagesouces.Add(img);
}
File.WriteAllLines(@"C:\Users\user\Desktop\WriteText.txt", imagesouces);