3

Tools: Ubuntu, Python, Selenium, Firefox

I am tying to automate the dowloading of image files from a subscription web site. I do not have access to the server other than through my paid subscription. To avoid having to click a button for each file download, I decided to automate it using Python, Selenium, and Firefox. (I have been using these three together for the first time for two days now. I also know very little about cookies.)

I am interested in downloading following three formats in order or preference: ['EPS', 'PNG', 'JPG']. A button for each format is available on the web site.

I have managed to have success in automating the downloading of the 'PNG' and 'JPG' files to disk by setting the Firefox preferences by hand as suggested in this post: python webcrawler downloading files

However, when the file is in an 'EPS' format, the "You have chosen to save" dialog box still pops open in the Firefox window.

As you can see from my code, I have set the preferences to save 'EPS' files to disk. (Again, 'JPG' and 'PNG' files are saved as expected.)

from selenium import webdriver

profile = webdriver.firefox.firefox_profile.FirefoxProfile()
profile.set_preference("browser.download.folderList", 1)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
                       'image/jpeg,image/png,application/postscript,'
                       'application/eps,application/x-eps,image/x-eps,'
                       'image/eps')
profile.set_preference("browser.helperApps.alwaysAsk.force", False)
profile.set_preference("plugin.disable_full_page_plugin_for_types",
                       "application/eps,application/x-eps,image/x-eps,"
                       "image/eps")
profile.set_preference(
    "general.useragent.override",
    "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0)"
    " Gecko/20100101 Firefox/26.0")
driver = webdriver.Firefox(firefox_profile=profile)

#I then log in and begin automated clicking to download files. 'JPG' and 'PNG' files are
#saved to disk as expected. The 'EPS' files present a save dialog box in Firefox.

I tried installing an extension for Firefox called "download-statusbar" that claims to negate any save dialog box from appearing. The extension loads in the Selenium Firefox browser, but it doesn't function. (A lot of reviews say the extension is broken despite the developers' insistence that it does function.) It isn't working for me anyway so I gave up on it.

I added this to the Firefox profile in that attempt:

#The extension loads, but it doesn't function.
download_statusbar = '/home/$USER/Downloads/'
                     '/download_statusbar_fixed-1.2.00-fx.xpi'
profile.add_extension(download_statusbar)

From reading other stackoverflow.com posts, I decided to see if I could download the file via the url with urllib2. As I understand how this would work, I would need to add cookies to the headers in order to authenticate the downloading of the 'EPS' file via a url.

I am unfamiliar with this technique, but here is the code I tried to use to download the file directly. It failed with a '403 Forbidden' response despite my attemps to set cookies in the urllib2 opener.

import urllib2
import cookielib
import logging
import sys

cookie_jar = cookielib.LWPCookieJar()
handlers = [
    urllib2.HTTPHandler(),
    urllib2.HTTPSHandler(),
]
[h.set_http_debuglevel(1) for h in handlers]
handlers.append(urllib2.HTTPCookieProcessor(cookie_jar))
#using selenium driver cookies, returns a list of dictionaries
cookies = driver.get_cookies()
opener = urllib2.build_opener(*handlers)
opener.addheaders = [(
    'User-agent',
    'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) '
    'Gecko/20100101 Firefox/26.0'
)]
logger = logging.getLogger("cookielib")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.DEBUG)
for item in cookies:
    opener.addheaders.append(('Cookie', '{}={}'.format(
        item['name'], item['value']
    )))
    logger.info('{}={}'.format(item['name'], item['value']))
response = opener.open('http://path/to/file.eps')
#Fails with a 403 Forbidden response

Any thoughts or suggestions? Am I missing something easy or do I need to give up hope on an automated download of the EPS files? Thanks in advance.

Community
  • 1
  • 1
dmmfll
  • 2,666
  • 2
  • 35
  • 41
  • 1
    You might want to use the method described [here](http://watirmelon.com/2011/09/07/determining-file-mime-types-to-autosave-using-firefox-watir-webdriver/) to check that the MIME type for the eps files is correct. – unutbu Jan 24 '14 at 20:43
  • @unutbu Thanks for the promising lead. I checked out your suggested link and tried downloading an EPS file manually. Strangely, Firefox gives me the opportunity to set a default behavior for JPG files as shown [in this screenshot](https://www.dropbox.com/s/7exk8h7yevfnaco/Screenshot%202014-01-24%2015.59.00.png) but the opportunity to set a default behavior for an EPS file is absent as shown [in this screenshot](https://www.dropbox.com/s/medsnjmgqri7w6r/Screenshot%202014-01-24%2015.59.12.png). I set the prefs in Firefox to "always ask me to save files". Maybe I need an EPS capable app installed – dmmfll Jan 24 '14 at 21:03
  • By opening an EPS file locally with the File>Open menu in Firefox I was able to get an expected dialog box with an option to "Do this automatically for files like this from now on." [screenshot](https://www.dropbox.com/s/ahslfma7r4n6x9e/Screenshot%202014-01-24%2016.25.12.png) Weird that it didn't happen on an EPS file download. – dmmfll Jan 24 '14 at 21:26
  • I now have an entry 'NC value="image/x-eps"' in the mimeTypes.rdf, but of course it was set by my locally opening a file. Is there another way to see what the mimeType that the server is sending when I attempt a download? – dmmfll Jan 24 '14 at 21:38

1 Answers1

2

Thank you to @unutbu for helping me solve this. I just didn't understand the anatomy of a file download. I do understand a little bit better now.

I ended up installing an extension called "Live HTTP Headers" on Firefox to examine the headers sent by the server. As it turned out, the 'EPS' files were sent with a 'Content-Type' of 'application/octet-stream'.

Now the EPS files are saved to disk as expected. I modified the Firefox preferences to the following:

profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
                   'image/jpeg,image/png,'
                   'application/octet-stream')
dmmfll
  • 2,666
  • 2
  • 35
  • 41