3

Here is my situation: I have to login to a Website and download a CSV from there, headless from a linux server. The page uses JS and does not work without it.

After some research I went with Selenium and PhantomJS. Logging in, setting the parameters for the CSV and finding the download button with Selenium/PhantomJS/Py3 was no problem, actually surprisingly enjoyable.

But clicking the download button did not do anything. After some research I found out that PhantomJS does not seem to support download-dialogs and downloads but that it is on the upcoming feature list.

So I thought I use a workaround with urllib after I found out that the download button is just calling a REST API Url. Problem is, it only works if you're logged into the site. So the first attempt failed as it returned: b'{"success":false,"session":"expired"}' which makes sense as I expect Selenium and urllib to use different sessions. So I thought I use the headers from Seleniums driver in urrlib trying this:

...
url = 'http://www.foo.com/api/index'
data = urllib.parse.urlencode({
        'foopara': 'cadbrabar',
    }).encode('utf-8')
headers = {}
for cookie in driver.get_cookies():
    headers[cookie['name']] = cookie['value']
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
    page = response.read()
driver.close()

Unfortunately this yielded the same result of an expired session. Am I doing somthing wrong, is there a way around this, other suggestions or am I at a dead end? Thanks in advance.

rikaidekinai
  • 304
  • 2
  • 10

4 Answers4

4

I found a solution and wanted to share it. One requirement changed, I am not using PhantomJS anymore but the chromedriver which works headlessly with a virtual framebuffer. Same result and it gets the job done.


What you need is:

pip install selenium pyvirtualdisplay

apt-get install xvfb

Download ChromeDriver


I use Py3.5 and a testfile from ovh.net with an tag instead of a button. The script waits for the to be present on the page then clicks it. If you don't wait for the element and are on an async site, the element you try to click might not be there yet. The download location is a folder relative to the scripts location. The script checks that directory if the file is downloaded already with a second delay. If I am not wrong files should be .part during download and as soon as it becomes the .dat specified in filename the script finishes. If you close the virtual framebuffer and driver before the download will not complete. The complete script looks like this:

# !/usr/bin/python
# coding: utf-8

import os
import sys
import time
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import glob


def main(argv):
    url = 'http://ovh.net/files'
    dl_dir = 'downloads'
    filename = '1Mio.dat'

    display = Display(visible=0, size=(800, 600))
    display.start()

    chrome_options = webdriver.ChromeOptions()
    dl_location = os.path.join(os.getcwd(), dl_dir)

    prefs = {"download.default_directory": dl_location}
    chrome_options.add_experimental_option("prefs", prefs)
    chromedriver = "./chromedriver"
    driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chrome_options)

    driver.set_window_size(800, 600)
    driver.get(url)
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//a[@href="' + filename + '"]')))

    hyperlink = driver.find_element_by_xpath('//a[@href="' + filename + '"]')
    hyperlink.click()

    while not(glob.glob(os.path.join(dl_location, filename))):
        time.sleep(1)

    driver.close()
    display.stop()

if __name__ == '__main__':
    main(sys.argv)

I hope this helps someone in the future.

Community
  • 1
  • 1
rikaidekinai
  • 304
  • 2
  • 10
  • @Jordan Lewis, This works for Linux system. can you please guide me how can I download file using headless chrome with selenium and python on windows 10? – Pratibha Apr 29 '17 at 15:28
  • Freaking genius. What a nightmare I ran into thinking I could download with `phantomjs`! Out of curiosity... why do you need the `display` stuff? In my case I'm clicking a button in a local webpage that downloads a file via embedded `js` code. I commented out the `display` stuff and it still worked... – Hendy Dec 27 '17 at 00:11
  • Ah. I think I got it. I *use* `chromium` as my browser. I put this in a script and not in `jupyter` (running in `chromium`, of course!) and without the `display` code, I saw a new browser window blip open to save my file. With the `display` code, that doesn't happen. – Hendy Dec 27 '17 at 00:24
1

If the button that you want to download has the file link, you are able to test downloading it using python code, because PhantonJs does not support a download by itself. So, if your download button does not provide the file link, you're not able to test.

To test using file link and phyton (to assert that file exists), you can follow this topic. As I'm a C# developer and testes, I don't know the better way to write the code in python without errors, but Im sure you can:

Basic http file downloading and saving to disk in python?

Community
  • 1
  • 1
Striter Alfa
  • 1,577
  • 1
  • 14
  • 31
  • The problem is that I don't know what really happens. The button triggers an AngularJS data-ng-click that sends the CSV. All I see recording the requests is that it triggers a REST URL, which as long as I am logged in in my browser, when loaded always sends me the CSV. Maybe the issue is that with PhantomJS/urllib I got a wrong referer URL. – rikaidekinai Jul 28 '16 at 11:53
1

I recently used Selenium to utilize ChromeDriver to download a file from the web. This works because Chrome automatically downloads the file and stores it in the Downloads file for you. This was easier than using PhantomJS.

I recommend looking into using ChromeDriver with Selenium and going that route: https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver

EDIT - As pointed out below, I neglected to point to how to set up ChromeDriver to run in headless mode. Here's more info: http://www.chrisle.me/2013/08/running-headless-selenium-with-chrome/

Or: https://gist.github.com/chuckbutler/8030755

Maurice Reeves
  • 1,572
  • 13
  • 19
  • OP needs **headless** access to target website – Andersson Jul 27 '16 at 15:02
  • @Andersson - Headless is possible with ChromeDriver and Selenium. I'll update the case to reflect that. – Maurice Reeves Jul 27 '16 at 15:29
  • 1
    I marked this as the correct answer. For me it is mostly about getting the job done. With the Chrome driver it works nicely to go to the page, login there, set the CSV parameters and click the download button. I can even set the download location. All that's left is to make it headless. Too bad people are waiting for 5 years now for the download feature in PhantomJS. – rikaidekinai Jul 28 '16 at 13:06
  • 1
    Upon quick research, running it headless in Linux would even be easier to do with `xvfb` and `PyVirtualDisplay`. This would be virtually headless then . – rikaidekinai Jul 28 '16 at 13:19
  • 1
    @rikaidekinai - I've done that as well, with `xvfb` and `PyVirtualDisplay`. I even did it on Windows 10 using Cygwin AND the new Bash for Windows, so it's cross-platform. Good luck! – Maurice Reeves Jul 28 '16 at 14:22
0

You can try something like:

from requests.auth import HTTPBasicAuth
import requests

url = "http://some_site/files?file=file.csv"  # URL used to download file
#  GET-request to get file content using your web-site's credentials to access file
r = requests.get(url, auth=HTTPBasicAuth("your_username", "your_password"))
#  Saving response content to file on your computer
with open("path/to/folder/to/save/file/filename.csv", 'w') as my_file:
    my_file.write(r.content)
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • That unfortunately doesn't work. Still the same JSON response {"success":false,"session":"expired"}. But was worth a try, I have other sites I fetch data from who actually offer an API with header authentification. – rikaidekinai Jul 28 '16 at 11:48