2

Given a certain website containing many resources, I need to automate the process of getting all the resources' URLs. To complicate matters, these URLs are not contained in the initially loaded markup but are instead inserted into the DOM via JavaScript, based on user interaction with the page.

Therefore I must retrieve results from the Network tab of Chrome's DevTools. But I'm having difficulty getting started.

Here's my first attempt:

Imports System.Text
Imports OpenQA.Selenium
Imports OpenQA.Selenium.Chrome
Imports OpenQA.Selenium.Support.UI

Friend Module Main
  Public Sub Main()
    Dim oBuilder As StringBuilder
    Dim oOptions As ChromeOptions
    Dim oDriver As IWebDriver
    Dim oWait As WebDriverWait
    Dim sType As String

    sType = LogType.Browser

    oBuilder = New StringBuilder

    oOptions = New ChromeOptions
    oOptions.SetLoggingPreference(sType, LogLevel.All)

    oDriver = New ChromeDriver(oOptions)
    oDriver.Navigate.GoToUrl("http://example.com")

    oWait = New WebDriverWait(oDriver, TimeSpan.FromSeconds(15))
    oWait.Until(Function(Driver) Driver.FindElement(By.TagName("a")))

    oDriver.Manage.Logs.GetLog(sType).ToList.ForEach(Sub(Log)
                                                       oBuilder.AppendLine($"Level:   {Log.Level}")
                                                       oBuilder.AppendLine($"Message: {Log.Message}")
                                                     End Sub)

    Console.WriteLine(oBuilder.ToString)
  End Sub
End Module

Upon the first run of this code, the StringBuilder contained only one LogEntry:

Timestamp            Level  Message
---------            -----  -------
2/25/2019 5:05:05 PM Severe http://example.com/favicon.ico - Failed to load resource: the server responded with a status of 404 (Not Found)

Since that first run, however, no logs are retrieved. Moreover, this is not the log I need. I need resource URLs.

There are three main problems to overcome here:

  1. When a page is retrieved from the browser's local cache, it appears there is no output to the log
  2. There doesn't appear to be a way to set the LogLevel, even though my code attempts to do so early on
  3. These logs are not resource URLs

How can I get the URLs from the DevTools Network tab? I found this quick sample—in fact it inspired my code above—but it's using the Java SDK. The two APIs seem slightly different.

InteXX
  • 6,135
  • 6
  • 43
  • 80

2 Answers2

0

According to How to set Chrome preferences using Selenium Webdriver .NET binding? you will need your own class like

public class ChromeOptionsWithPrefs: ChromeOptions
{
    public Dictionary<string,object> prefs { get; set; }
}

public static void Initialize()
{
    var options = new ChromeOptionsWithPrefs();
    options.prefs = new Dictionary<string, object>
    {
        { "enableNetwork", true },
        { "traceCategories", "netlog,devtools.timeline,devtools" }
    };
    _driver = new ChromeDriver(@"C:\path\chromedriver", options);
}

I cannot verify the code, but in Java you need to enable network logging just like that. The trace categories can be found with the pseudo url chrome://tracing/ in Chrome.

Jens Dibbern
  • 1,434
  • 2
  • 13
  • 20
  • That compiles and runs, thank you, but it still produces no more output than indicated above. – InteXX Mar 01 '19 at 02:53
  • I switched the `LogType` to `Driver`, and this time I got 755 lines of JSON-formatted log data. Progress. Unfortunately, however, the only URL information is for that of the main request, i.e. `http://example.com`. Resource URLs aren't reported. – InteXX Mar 01 '19 at 07:56
  • I did some digging in my archive and added the trace categories. – Jens Dibbern Mar 01 '19 at 17:13
  • I appreciate the additional info, but it's still not giving up much. I'm navigating to the standard IIS start page at `http://localhost`, hoping to find `http://localhost/iisstart.png` somewhere in the logs. I can think of two possibilities: 1) The cache needs to be cleared and disabled prior to navigation; 2) An additional command is needed to turn on request monitoring. I'm new to Selenium, so I'm unsure about these (or how to implement them). Are they worth investigation in your opinion? – InteXX Mar 01 '19 at 21:03
  • I tried the Java version of this with a random youtube video and got the resource downloads in the logs. – Jens Dibbern Mar 01 '19 at 21:11
  • Hm.. that's sobering news. Java isn't an option for me on this one. I may have to investigate other approaches to this problem. – InteXX Mar 01 '19 at 21:24
  • FYI I tried using `JavaScriptExecutor.ExecuteScript("return document.body.outerHTML")` in an attempt to discover all resource URLs, and I got lucky. It worked. The information I'm looking for is contained in that markup. So it turns out there's no need to access Chrome DevTools at all—at least for this page, which is all that's necessary for this project. – InteXX Mar 02 '19 at 23:18
0

You can get all resources with Selenium using browser logs.

 def get_logs(self):
        logs = self.browser.get_log('performance')
        return logs
Ger Mc
  • 630
  • 3
  • 11
  • 22
  • That looks to be the Python/Java implementation. The .NET implementation has a different API. In any case, I'm not sure this call will work in .NET; the Chrome driver doesn't provide the `performance` log. More [info](https://github.com/adamdriscoll/selenium-powershell/issues/12#issuecomment-467650342). – InteXX Mar 04 '19 at 20:23
  • Ah yes, my bad. Maybe look here: https://stackoverflow.com/questions/50986959/how-to-set-up-performance-logging-in-seleniumwebdriver-with-chrome – Ger Mc Mar 05 '19 at 13:53