0

I'm trying to download files from http://www.oracle.com/technetwork/server-storage/developerstudio/downloads/index.html in a headless context. I have an account (they are free), but the site really doesn't make it easy, apparently it uses a chain of javascript forms/redirection. With Firefox I can use the element inspector, copy the url of the file as cURL when the download starts, and use it in a headless machine to download the file, but so far all my attempts to get the file only in the headless machine have failed.

I've managed to get the login with:

#!/usr/bin/env python3

username="<my username>"
password="<my password>"

import requests
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
caps = DesiredCapabilities.PHANTOMJS
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0"
driver = webdriver.PhantomJS("/usr/local/bin/phantomjs")
driver.set_window_size(1120, 550)
driver.get("http://www.oracle.com/technetwork/server-storage/developerstudio/downloads/index.html")
print("loaded")
driver.find_element_by_name("agreement").click()
print("clicked agreement")
driver.find_element_by_partial_link_text("RPM installer").click()
print("clicked link")
driver.find_element_by_id("sso_username").send_keys(username)
driver.find_element_by_id("ssopassword").send_keys(password)
driver.find_element_by_xpath("//input[contains(@title,'Please click here to sign in')]").click()
print("submitted")

print(driver.get_cookies())

print(driver.current_url)
print(driver.page_source)
driver.quit()

I suspect the login worked, because in the cookies I see some data associated with my username, but in Firefox submitting the form results in a download starting after 3-4 redirections, while here I get nothing and the page_source and current_url still belong to the login page.

Maybe the site is actively blocking this kind of use, or maybe I'm doing something wrong. Any idea how to actually download the file?

Jellby
  • 2,360
  • 3
  • 27
  • 56
  • See this issue. https://bugs.chromium.org/p/chromium/issues/detail?id=696481. I think the feature is not yet available in chromedriver – Tarun Lalwani Sep 14 '17 at 15:24
  • @TarunLalwani Does selenium + phantomjs use chromium under the hood? – Jellby Sep 14 '17 at 15:41
  • No, but phantomjs is also now not being maintained. So use it very carefully. If it works then its good if not then think of something else – Tarun Lalwani Sep 14 '17 at 15:44
  • @TarunLalwani I'm not tied to phantomjs, I actually had never used it before this, so any suggestions for alternatives is welcome. But I gather from your comment that you don't have a solution either ;) – Jellby Sep 14 '17 at 16:01
  • As you are using `Firefox element inspector` why don't you try `Headless Firefox` browser? – undetected Selenium Sep 14 '17 at 16:40
  • @DebanjanB Because I'd rather not install Firefox and all its dependencies in the headless machine, unless finally needed. – Jellby Sep 14 '17 at 17:04
  • I had the same issue earlier. Since you can get the download url, you can download the file into the browser and save it as a blob. Then you can extract the contents of the blob from selenium's execute script. See my answer for a similar question: https://stackoverflow.com/a/46030975/4110233 – TheChetan Sep 14 '17 at 17:27
  • @TheChetan Hm... I'm not sure I can get the download url. With a full-fledged Firefox I can get the cURL line from the element inspector, but that includes cookies and a telling "Auth" address parameter. I haven't been able to get this with selenium. – Jellby Sep 14 '17 at 18:09
  • @Jellby just try it. Thats the best part of this, you dont need to know all the headers and cookies, the browser autofills it for you. – TheChetan Sep 14 '17 at 18:11
  • @TheChetan I did try. Both with the file URL (`http://download.oracle.com/otn/solaris/studio/12x/OracleDeveloperStudio12.6-linux-x86-rpm.tar.bz2`) and with `driver.current_url`. I got an empty file in both cases. But maybe I just have to increase the time delay (I had 300 seconds), as I can see with `atop` that there is network activity and the file is quite large. – Jellby Sep 14 '17 at 18:16
  • @Jellby Dont add a delay, add a listener event instead, on completion of the download, you can trigger the script to extract the blob. Alternatively if you know under normal circumstances that it will finish in some (say 5 mins) then you can try the delay. – TheChetan Sep 14 '17 at 18:18
  • @TheChetan I would if I knew how. Anyway, 600 seconds didn't help and net activity stops after a while, so if something is being downloaded I guess it's done. Still, I get `downloaded_file == None`, apparently. – Jellby Sep 14 '17 at 18:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/154478/discussion-between-thechetan-and-jellby). – TheChetan Sep 14 '17 at 18:31

1 Answers1

1

Thanks to TheChetan's comment I got it working. I didn't use the javascript-blob route though, but the requests approach suggested by Tarun Lalwani in https://stackoverflow.com/a/46027215. It took me a while to realize I had to modify the user agent in the request too. Finally this works for me:

#!/usr/bin/env python3

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from requests import Session
from urllib.parse import urlparse
from os.path import basename
from hashlib import sha256
import sys

index_url = "http://www.oracle.com/technetwork/server-storage/developerstudio/downloads/index.html"
link_text = "RPM installer"
username="<my username>"
password="<my password>"
user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0"

# set up browser
caps = DesiredCapabilities.PHANTOMJS
caps["phantomjs.page.settings.userAgent"] = user_agent
driver = webdriver.PhantomJS("/usr/local/bin/phantomjs")
driver.set_window_size(800,600)

# load index page and click through
driver.get(index_url)
print("loaded")
driver.find_element_by_name("agreement").click()
print("clicked agreement")
link = driver.find_element_by_partial_link_text(link_text)
sha = driver.find_element_by_xpath("//*[contains(text(), '{0}')]/following::*[contains(text(), 'sum:')]/following-sibling::*".format(link_text)).text
file_url = link.get_attribute("href")
filename = basename(urlparse(file_url).path)
print("filename: {0}".format(filename))
print("checksum: {0}".format(sha))
link.click()
print("clicked link")
driver.find_element_by_id("sso_username").send_keys(username)
driver.find_element_by_id("ssopassword").send_keys(password)
driver.find_element_by_xpath("//input[contains(@title,'Please click here to sign in')]").click()
print("submitted")

# we should be logged in now

def progressBar(title, value, endvalue, bar_length=60):
  percent = float(value) / endvalue
  arrow = '-' * int(round(percent * bar_length)-1) + '>'
  spaces = ' ' * (bar_length - len(arrow))
  sys.stdout.write("\r{0}: [{1}] {2}%".format(title, arrow + spaces, int(round(percent * 100))))
  sys.stdout.flush()

# transfer the cookies to a new session and request the file
session = Session()
session.headers = {"user-agent": user_agent}
for cookie in driver.get_cookies():
  session.cookies.set(cookie["name"], cookie["value"])
driver.quit()
r = session.get(file_url, stream=True)
# now we should have gotten the url with param
new_url = r.url
print("final url {0}".format(new_url))
r = session.get(new_url, stream=True)
print("requested")
length = int(r.headers['Content-Length'])
title = "Downloading ({0})".format(length)
sha_file = sha256()
chunk_size = 2048
done = 0
with open(filename, "wb") as f:
  for chunk in r.iter_content(chunk_size):
    f.write(chunk)
    sha_file.update(chunk)
    done = done+len(chunk)
    progressBar(title, done, length)
print()

# check integrity
if (sha_file.hexdigest() == sha):
  print("checksums match")
  sys.exit(0)
else:
  print("checksums do NOT match!")
  sys.exit(1)

So at the end the idea is using selenium+phantomjs for logging in, and then using the cookies for a plain request.

Jellby
  • 2,360
  • 3
  • 27
  • 56