Python: Unable to download with selenium in webpage

Question

My purpose it to download a zip file from https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa It is a link in this webpage https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa. Then save it into this directory "/home/vinvin/shKLSE/ (I am using pythonaywhere). Then unzip it and the csv file extract in the directory.

The code run until the end with no error but it does not downloaded. The zip file is automatically downloaded when click on https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa manually.

My code with a working username and password is used. The real username and password is used so that it is easier to understand the problem.

    #!/usr/bin/python
    print "hello from python 2"

    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    

    display = Display(visible=0, size=(800, 600))
    display.start()

    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', '/zip')

    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)

    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(30)

    browser.close()
    browser.quit()
    display.stop()

   zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
   zip_ref.extractall(/home/vinvin/sh/KLSE)
   zip_ref.close()
   os.remove(zip_ref)

HTML snippet:

<li><a href="/prices/price_download_zip_file.zip?type=history_all&amp;market=bursa">All Historical Data</a> <span>About 220 MB</span></li>

Note that &amp is shown when I copy the snippet. It was hidden from view source, so I guess it is written in JavaScript.

Observation I found

The directory home/vinvin/shKLSE do not created even I run the code with no error
I try to download a much smaller zip file which can be completed in a second but still do not download after a wait of 30s. dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_daily&date=20170519&market=bursa']").click()

Was curious to know if a Solution on Windows platform can be acceptable for you? Thanks — undetected Selenium, May 22 '17 at 16:24
Observed that you have used `folderList`, `showWhenStarting`, `download.dir` & `neverAsk.saveToDisk` in your code but you havn't mentioned about them in the description. Do you have any requirement based on those features? Thanks — undetected Selenium, May 23 '17 at 15:33
No, as long as it is downloaded in the directory in zip and unzip it. — bkcollection, May 24 '17 at 05:23

undetected Selenium · Answer 1 · 2017-12-19T15:08:45.307

I don't see any major drawback in your code block as such. But here are a few recommendations through this Solution & the execution of this Automated Test Script:

This code works perfect in Off Market Hours. During Market Hours a lot of JavaScript & Ajax Calls are in play and handling those are beyond the scope of this Question.
You may consider checking for the the intended download directory first & if not available, create a new one. That code block for this functionality is in Windows style and works perfect on Windows platform.
Once you click on "Login" induce some wait for the HTML DOM to render properly.
When you want to see off the downloading process, you need to set certain more preferences in the FirefoxProfile as mentioned in my code below.
Always consider maximizing the browser window through browser.maximize_window()
When you start downloading you need to wait for sufficient amount of time to get the file completely downloaded.
If you are using browser.quit() at the end you don't need to use browser.close()
You may consider to replace all the time.sleep() with either of ImplicitlyWait or ExplicitWait or FluentWait.

Here is your own code block with some simple tweaks in it:

#!/usr/bin/python
print "hello from python 2"

import urllib2
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from pyvirtualdisplay import Display
import requests, zipfile, os    

display = Display(visible=0, size=(800, 600))
display.start()

newpath = 'C:\\home\\vivvin\\shKLSE'
if not os.path.exists(newpath):
    os.makedirs(newpath)    

profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.dir",newpath);
profile.set_preference("browser.download.folderList",2);
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/zip");
profile.set_preference("browser.download.manager.showWhenStarting",False);
profile.set_preference("browser.helperApps.neverAsk.openFile","application/zip");
profile.set_preference("browser.helperApps.alwaysAsk.force", False);
profile.set_preference("browser.download.manager.useWindow", False);
profile.set_preference("browser.download.manager.focusWhenStarting", False);
profile.set_preference("browser.helperApps.neverAsk.openFile", "");
profile.set_preference("browser.download.manager.alertOnEXEOpen", False);
profile.set_preference("browser.download.manager.showAlertOnComplete", False);
profile.set_preference("browser.download.manager.closeWhenDone", True);
profile.set_preference("pdfjs.disabled", True);

for retry in range(5):
    try:
        browser = webdriver.Firefox(profile)
        print "firefox"
        break
    except:
        time.sleep(3)
time.sleep(1)

browser.maximize_window()
browser.get("https://www.shareinvestor.com/my")
time.sleep(10)
login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
time.sleep(10)
print browser.current_url
username = browser.find_element_by_id("sic_login_header_username")
password = browser.find_element_by_id("sic_login_header_password")
print "find id done"
username.send_keys("bkcollection")
password.send_keys("123456")
print "log in done"
login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()
browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
print browser.current_url
time.sleep(20)
dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
time.sleep(900)

browser.close()
browser.quit()
display.stop()

zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
zip_ref.extractall(/home/vinvin/sh/KLSE)
zip_ref.close()
os.remove(zip_ref)

Let me know if this Answers your Question.

Thanks. How do you suggest to replace the `ImplicitlyWait` or `ExplicitWait `or `FluentWait.`? Make sure it is downloaded before procedd is the best idea. — bkcollection, May 26 '17 at 08:27
@bkcollection You can induce `ImplicitlyWait` once at the beginning of your script so Selenium waits consistently. You can also use `ExplicitWait` for certain elements to be visible/clickable in the HTML DOM. Alternatively you can use `FluentWait` to speedup your execution by frequent polling, ignoring the Exceptions. To `Make sure it is downloaded before procedd` you may consider to use `urllib.request.urlretrieve` module which will block until download is done. Thanks — undetected Selenium, May 26 '17 at 11:45
could you edit the answer. I do know that `time.sleep` is bad idea. Thanks — bkcollection, May 26 '17 at 12:04
@bkcollection Its all your own code. I just added the minimal tweaks required to make your code block working :) Thanks — undetected Selenium, May 26 '17 at 12:10
Could you make it better for the download part, to make sure it is downloaded completely before proceed to unzip the file? Thanks a lot — bkcollection, May 26 '17 at 12:18
@DebanjanB It worked for me. But is there a way to rename the downloaded file? — abhi1610, May 28 '18 at 11:43
@abhi1610 Can you raise a new question as per your new requirement please? But I suppose there are existing QA/discussion on this topic. Please check once. — undetected Selenium, May 28 '18 at 11:45

score 3 · Accepted Answer · answered May 23 '17 at 15:40

I rewrote your script, with comments explaining why I made the changes I made. I think your main problem might have been a bad mimetype, however, your script had a log of systemic issues that would have made it unreliable at best. This rewrite uses explicit waits, which completely removes the need to use time.sleep(), allowing it to run as fast as possible, while also eliminating errors that arise from network congestion.

You will need do the following to make sure all modules are installed:

pip install requests explicit selenium retry pyvirtualdisplay

The script:

#!/usr/bin/python

from __future__ import print_function  # Makes your code portable

import os
import glob
import zipfile
from contextlib import contextmanager

import requests
from retry import retry
from explicit import waiter, XPATH, ID
from selenium import webdriver
from pyvirtualdisplay import Display
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait

DOWNLOAD_DIR = "/tmp/shKLSE/"


def build_profile():
    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', DOWNLOAD_DIR)
    # I think your `/zip` mime type was incorrect. This works for me
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
                           'application/vnd.ms-excel,application/zip')

    return profile


# Retry is an elegant way to retry the browser creation
# Though you should narrow the scope to whatever the actual exception is you are
# retrying on
@retry(Exception, tries=5, delay=3)
@contextmanager  # This turns get_browser into a context manager
def get_browser():
    # Use a context manager with Display, so it will be closed even if an
    # exception is thrown
    profile = build_profile()
    with Display(visible=0, size=(800, 600)):
        browser = webdriver.Firefox(profile)
        print("firefox")
        try:
            yield browser
        finally:
            # Let a try/finally block manage closing the browser, even if an
            # exception is called
            browser.quit()


def main():
    print("hello from python 2")
    with get_browser() as browser:
        browser.get("https://www.shareinvestor.com/my")

        # Click the login button
        # waiter is a helper function that makes it easy to use explicit waits
        # with it you dont need to use time.sleep() calls at all
        login_xpath = '//*/div[@class="sic_logIn-bg"]/a'
        waiter.find_element(browser, login_xpath, XPATH).click()
        print(browser.current_url)

        # Log in
        username = "bkcollection"
        username_id = "sic_login_header_username"
        password = "123456"
        password_id = "sic_login_header_password"
        waiter.find_write(browser, username_id, username, by=ID)
        waiter.find_write(browser, password_id, password, by=ID, send_enter=True)

        # Wait for login process to finish by locating an element only found
        # after logging in, like the Logged In Nav
        nav_id = 'sic_loggedInNav'
        waiter.find_element(browser, nav_id, ID)

        print("log in done")

        # Load the target page
        target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?"
                      "type=price_download_all_stocks_bursa")
        browser.get(target_url)
        print(browser.current_url)

        # CLick download button
        all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?"
                          "type=history_all&market=bursa']")
        waiter.find_element(browser, all_data_xpath, XPATH).click()

        # This is a bit challenging: You need to wait until the download is complete
        # This file is 220 MB, it takes a while to complete. This method waits until
        # there is at least one file in the dir, then waits until there are no
        # filenames that end in `.part`
        # Note that is is problematic if there is already a file in the target dir. I
        # suggest looking into using the tempdir module to create a unique, temporary
        # directory for downloading every time you run your script
        print("Waiting for download to complete")
        at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) > 0
        WebDriverWait(glob.glob, 300).until(at_least_1)

        no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0
        WebDriverWait(glob.glob, 300).until(no_parts)

        print("Download Done")

        # Now do whatever it is you need to do with the zip file
        # zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')
        # zip_ref.extractall(DOWNLOAD_DIR)
        # zip_ref.close()
        # os.remove(zip_ref)

        print("Done!")


if __name__ == "__main__":
    main()

Full disclosure: I maintain the explicit module. It is designed to make using explicit waits much easier, for exactly situations like this, where websites slowly load in dynamic content based on user interactions. You could replace all of the waiter.XXX calls above with direct explicit waits.

Can you explain `@retry(Exception, tries=5, delay=3) ` and `@contextmanager `? Is there any other common module that can use to replace `retry` and `explicit`? Pythonanywhere does not have these two modules. — bkcollection, May 26 '17 at 02:22
@bkcollection, before I update my answer, can you first tell me if you can use this to install external dependencies? Specifically, `retry` and `explicit`? `https://help.pythonanywhere.com/pages/InstallingNewModules/` — Levi Noecker, May 26 '17 at 02:37
it works fine basically. But it normally stuck after `print "firefox"` and need to rerun again. It happens like 2 times out of three — bkcollection, May 26 '17 at 08:28

score 1 · Answer 3 · 2017-05-22T13:32:46.567

The reason is due to the webpage is loading slowly. I added a wait of 20 seconds after open the webpage link

login_attempt.submit()
browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
print browser.current_url
time.sleep(20)
dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()

It returns no error.

Additional, /zip is incorrect MIME type. Change to profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/zip')

The final correction :

   #!/usr/bin/python
    print "hello from python 2"

    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    

    display = Display(visible=0, size=(800, 600))
    display.start()

    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
    # application/zip not /zip
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/zip')

    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)

    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(30)

    browser.close()
    browser.quit()
    display.stop()

   zip_ref = zipfile.ZipFile('/home/vinvin/shKLSE/file.zip', 'r')
   zip_ref.extractall('/home/vinvin/shKLSE')
   zip_ref.close()
   # remove with correct path
   os.remove('/home/vinvin/shKLSE/file.zip')

Thanks, it works fine with your code, with small tweaks – bkcollection May 26 '17 at 09:36 — bkcollection, May 26 '17 at 09:36

score 1 · Answer 4 · answered May 22 '17 at 05:19

1

Take it out side the scope of the selenium. Change the preference settings so that when the link is clicked (First check if link is valid) it gives you a pop up asking to save , now use sikuli http://www.sikuli.org/ to click on the popup. Mime types does not always work, and there is no black and white answer why it is not working.

answered May 22 '17 at 05:19

Subrata Fouzdar

724
5
17

As mentioned, no pop up. you try can manually with the username and password. – bkcollection May 22 '17 at 06:12

Gaurang Shah · Answer 5 · 2017-05-22T17:32:44.273

I haven't tried on the site you mentioned, however following code works perfectly and downloads the ZIP. if you are not able to download the zip, Mime type could be different. you can use chrome browser and network inspection to check the mime type of the file you are trying to download.

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/zip')

browser = webdriver.Firefox(profile)
browser.get("http://www.colorado.edu/conflict/peace/download/peace.zip")

Python: Unable to download with selenium in webpage

5 Answers5

Linked

Related