0

At work, we log into a provider's website that serves as a repository of files. A list of the files appears. Each filename is a link. Click the link, and download the file. It's a very lightweight website.

I'm trying to log in and download the files without the tedious task of clicking each one (there's no "select all" checkbox). I'm using the WebBrowser control on a form with a Go button to begin. Here's the code. Please skip down to the row of asterisks.

Private Sub btnGo_Click(sender As Object, e As EventArgs) Handles btnGo.Click
    Try
        PageLoaded = False
        browser.Navigate("https://[the website]/Account/Login.htm", False)
        While Not PageLoaded
            Application.DoEvents()
        End While
    Catch ex As Exception
        MsgBox(ex.Message)
    End Try
    Try
        browser.Document.GetElementById("username").InnerText = [username]
        browser.Document.GetElementById("password").InnerText = [password]
        PageLoaded = False
        browser.Document.Forms("mainform").InvokeMember("submit")
        While Not PageLoaded
            Application.DoEvents()
        End While
    Catch ex As Exception
        MsgBox(ex.Message)
    End Try

    ' ************************************
    Dim mycookies As String
    mycookies = browser.Document.Cookie
    ' DEBUG: verified cookies are indeed present

    Try
        Dim cookieJar As New CookieContainer
        Dim cookies As String() = browser.Document.Cookie.Split({"; "}, StringSplitOptions.RemoveEmptyEntries)
        Dim cookievaluepairs() = cookies(0).Split("=")
        Dim cky As New Cookie(cookievaluepairs(0), cookievaluepairs(1))
        cky.Domain = browser.Document.Domain
        cookieJar.Add(cky)
        Dim cookievaluepairs1() = cookies(1).Split("=")
        Dim cky1 As New Cookie(cookievaluepairs(0), cookievaluepairs(1))
        cky1.Domain = browser.Document.Domain
        cookieJar.Add(cky1)
        ' DEBUG: verified cookieJar contains expected cookies

        Dim wwwclient As New CookieAwareWebClient(cookieJar)
        ' DEBUG: please see class code below

        Dim x As Integer
        Dim dlurl As String = ""
        Dim inputs As HtmlElementCollection = browser.Document.Links
        For Each elm As HtmlElement In inputs
            If Microsoft.VisualBasic.Left(elm.OuterHtml, 10) = "<A href=""/" Then
                dlurl = elm.GetAttribute("href")
                ' DEBUG: crappily named dlurl indeed has correct URI

                wwwclient.DownloadFile(dlurl, "D:\Desktop\file" & x)
                ' DEBUG: overriden function GetWebRequest fires
                '        please see class code below

            End If
        Next
    Catch ex As Exception
        MsgBox(ex.Message)
        ' DEBUG: always lands here with 401 error

    End Try
End Sub

Here's one of the many versions of CookieAwareWebClient found here on SO.

Public Class CookieAwareWebClient
    Inherits WebClient

    Private m_container As CookieContainer = New CookieContainer()

    Public Sub New(cc As CookieContainer)
        m_container = cc
        ' DEBUG: verified m_container now has cookieJar passed as cc
    End Sub


    Protected Overrides Function GetWebRequest(ByVal address As Uri) As WebRequest
        Dim request As WebRequest = MyBase.GetWebRequest(address)
        Dim webRequest As HttpWebRequest = TryCast(request, HttpWebRequest)

        If webRequest IsNot Nothing Then
            webRequest.CookieContainer = m_container
        End If

        Return webRequest
        ' DEBUG: verified webRequest.CookieContainer is correct
    End Function
End Class

I single-step through the code all the way to the wwwclient.DownloadFile statement, then through the code in the GetWebRequest function, and after a pause, I get a 401 Not Authorized. This has happened with the five or six variations of CookieAwareWebClient I've found.

The two cookies I retrieve from the WebBrowser control after the code successfully logs itself in look like this (different token every time obv).

"samlssologgedout=SSO%20Logged%20Out" "token=A4AA416E-46C8-11e9-92CD-005056A005E4"

I've verified that those are the same cookies that go into 'webRequest.CookieContainer'. As well, in the WebBrowser control, after log in, you can click on the file's link to download it.

Does anybody see anything obviously wrong in the code?

Still googling while writing the question, I just came across Notes to Inheritors in the MS documentation for WebClient -- "Derived classes should call the base class implementation of WebClient to ensure the derived class works as expected."

That sounds like something you would do in the constructor? Or is this taken care of in the statement MyBase.GetWebRequest(address)?

RobertSF
  • 488
  • 11
  • 24

1 Answers1

0

After much hack and google, I'm going to conclude it's a myth you can make WebClient "cookie aware." I never could make it work, and almost all the threads about it that I read concluded with no solution. And anyway, WebClient is apparently deprecated.

To recap, the mission was to automate the login and download of files from a low-security website that uses forms authentication. The WebBrowser control would have worked fine, except that it uses IE, and IE refuses to download PDF files silently. It insists on prompting whether to open, save, or discard.

I started playing around with HTTPWebRequest, HTTPRequest, WebRequest, HTTPClient, and a bunch of variations, and got nowhere. Then it occurred to me to look for a Chrome-based WebBrowser control, and I stumbled across Selenium. That proved to be the solution for me.

Selenium's principal use appears to be to test software, but it also lets you manipulate web pages. You can easily install it within Visual Studio through NuGet. You also need to install a browser-specific driver. There are drivers for every major browser, but using the IE driver would be pointless because I would still have the problem of being prompted at every file. I instead downloaded the Chrome and Firefox drivers. They allow users here to choose between the two, and it's about 50/50.

Here's how simple the code was in the end.

Dim Options = New FirefoxOptions
Options.SetPreference("browser.download.folderList", 2)
'Options.SetPreference("browser.download.dir", "C:\\Windows\\temp")
Options.SetPreference("browser.download.useDownloadDir", True)
Options.SetPreference("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream")
Options.SetPreference("pdfjs.disabled", True)
Dim driverService = FirefoxDriverService.CreateDefaultService()
driverService.HideCommandPromptWindow = True
Dim browser = New FirefoxDriver(driverService, Options)
browser.Url = "https://[the website]"
browser.Navigate()
Dim elm = browser.FindElementById("username")
elm.SendKeys([the username])
elm = browser.FindElementById("password")
elm.SendKeys([the password])
elm = browser.FindElementById("loginSubmit")
elm.Click()
While InStr(browser.Url, "token") = 0
    Application.DoEvents()
End While
Dim links As IList(Of IWebElement) = browser.FindElementsByPartialLinkText(".")
For Each link As IWebElement In links
    link.Click()
Next

I ran into a problem with the neverAsk.saveToDisk part. It just wasn't working. It turned out that I had the wrong mime type. I got the solution to that from this comment - Set Firefox profile to download files automatically using Selenium and Java

RobertSF
  • 488
  • 11
  • 24