0

I'm trying to scrape all that's inside the html tag.

Basically it gets to the GoToUrl line, it opens the page in th browser but then it doesn't do further in the code.

It just times out after 60 seconds.

Here's the error:

fail: Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
      An unhandled exception has occurred while executing the request.

Update: edited for privacy reasons.

miatochai
  • 343
  • 3
  • 15
  • Application is either timing out of hitting an execution error. If application is failing at exactly one minute than it probably failing due to timeout. Use a watch and see how long app runs. You may need to use a KeepAlive to prevent application from closing. See : https://learn.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.keepalive?force_isolation=true&view=net-6.0 – jdweng Aug 10 '22 at 13:58
  • You see localhost there because that's where the dev/wire protocol of the browser lives. This is a command timeout between the driver and the browser. (there's a localhost server in between to pass commands to/from driver/browser.) The GoToURL command not only loads the URL, but waits until the page is ready. That bit of it is timing out. (You should be able to set that timeout in driver.manage): https://stackoverflow.com/questions/10606703/selenium-webdriver-how-to-set-page-load-timeout-using-c-sharp – pcalkins Aug 10 '22 at 17:56

1 Answers1

1

I made an example for your scenario.

Lets say, we want to scrape the posts in the home page so we need a model to store our data:

public class Post
{
    public string ImageSrc { get; set; }
    public string Category { get; set; }
    public string Title { get; set; }
    public string Description { get; set; }
    public string Date { get; set; }

    public override string ToString()
    {
        return JsonSerializer.Serialize(this, 
              new JsonSerializerOptions { WriteIndented = true });
    }
}

Next we need to initialize selenium webdriver

var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
using var driver = new ChromeDriver(options);

// Here we setup a fluent wait
var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(20))
{
    PollingInterval = TimeSpan.FromMilliseconds(250)
};
wait.IgnoreExceptionTypes(typeof(NoSuchElementException), typeof(StaleElementReferenceException));

// Navigate to the target url
driver.Navigate().GoToUrl("https://www.rtlnieuws.nl/zoeken?q=Philips+fraude");

// Accept cookies
var cookieBtn = wait.Until(driver => driver.FindElement(By.Id("onetrust-accept-btn-handler")));
cookieBtn.Click();

// Scroll to end
int count = 0; 
await driver.ScrollToEndAsync(d =>
{
    // Determine when we are at the end of the page
    var tempCount = d.FindElements(By.XPath("//a[@class = 'search-item search-item--artikel']")).Count;
    if (tempCount != count)
    {
        count = tempCount;
        return false;
    }       
    
    return true;
});

// List of post elements
var elements = wait.Until(driver =>
{
    return driver.FindElements(By.XPath("//div[@class = 'search-items']//a[contains(@class, 'search-item')]"));
});

// Print Posts in json format 
foreach (var e in elements)
{
    var post = new Post
    {
        ImageSrc = e.FindElement(By.XPath(".//img")).GetAttribute("src"),
        Category = e.FindElement(By.XPath(".//div/span")).Text,
        Title = e.FindElement(By.XPath(".//div/h2")).Text,
        Description = e.FindElement(By.XPath(".//div[@class = 'search-item__content']/p[@class = 'search-item__description']")).Text,
        Date = e.FindElement(By.XPath(".//div[@class = 'search-item__content']//span[@class = 'search-item__date']")).Text,
    };
    Console.WriteLine(post);
}

// Just for this sample in order to wait to see our results 
Console.ReadLine();

In order to use ScrollToEndAsync like above, you must create an extension method:

public static class WebDriverExtensions
{
    public static async Task ScrollToEndAsync(this IWebDriver driver, Func<IWebDriver, bool> pageEnd)
    {
        while (!pageEnd.Invoke(driver))
        {
            var js = (IJavaScriptExecutor)driver;
            js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
            
            // Arbitrary delay between scrolling
            await Task.Delay(200);
        }
    }
}
ggeorge
  • 1,496
  • 2
  • 13
  • 19
  • Hey, thank you! I already have a model that I search with xpaths - but I do it with HTMLAgilityPack, so my plan was just to get all the HTML with Selenium and pass it to the existing HAP scraper. I think I might use Selenium for everything, because HAP can't really scroll or do anything with dynamc pages. – miatochai Aug 11 '22 at 07:50
  • 1
    I suppose you can use HAP together with Selenium, first you need to load all the page (via scrolling in this example) and then get the page source `var src = driver.PageSource;`. Now you will have the complete html page to load to HAP – ggeorge Aug 11 '22 at 07:55