1

I'm trying to download large files (approx. 1GB) with mechanize module, but I have been unsuccessful. I've been searching for similar threads, but I have found only those, where the files are publicly accessible and no login is required to obtain a file. But this is not my case as the file is located in the private section and I need to login before the download. Here is what I've done so far.

import mechanize

g_form_id = ""

def is_form_found(form1):
    return "id" in form1.attrs and form1.attrs['id'] == g_form_id

def select_form_with_id_using_br(br1, id1):
    global g_form_id
    g_form_id = id1
    try:
        br1.select_form(predicate=is_form_found)
    except mechanize.FormNotFoundError:
        print "form not found, id: " + g_form_id
        exit()

url_to_login = "https://example.com/"
url_to_file = "https://example.com/download/files/filename=fname.exe"
local_filename = "fname.exe"

br = mechanize.Browser()
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this
br.addheaders =  [('User-agent', 'Firefox')]

response = br.open(url_to_login)
# Find login form
select_form_with_id_using_br(br, 'login-form')
# Fill in data
br.form['email'] = 'email@domain.com'
br.form['password'] = 'password'
br.set_all_readonly(False)    # allow everything to be written to
br.submit()

# Try to download file
br.retrieve(url_to_file, local_filename)

But I'm getting an error when 512MB is downloaded:

Traceback (most recent call last):
  File "dl.py", line 34, in <module>
    br.retrieve(br.retrieve(url_to_file, local_filename)
  File "C:\Python27\lib\site-packages\mechanize\_opener.py", line 277, in retrieve
    block = fp.read(bs)
  File "C:\Python27\lib\site-packages\mechanize\_response.py", line 199, in read
    self.__cache.write(data)
MemoryError: out of memory

Do you have any ideas how to solve this? Thanks

Milan Skála
  • 1,656
  • 3
  • 15
  • 20

2 Answers2

1

You can use bs4 and requests to get you logged in then write the streamed content. There are a few form fields required including a _token_ field that is definitely necessary:

from bs4 import BeautifulSoup
import requests
from urlparse import urljoin

data = {'email': 'email@domain.com', 'password': 'password'}
base = "https://support.codasip.com"

with requests.Session() as s:
    # update headers
    s.headers.update({'User-agent': 'Firefox'})

    # use bs4 to parse the from fields
    soup = BeautifulSoup(s.get(base).content)
    form = soup.select_one("#frm-loginForm")
    # works as it is a relative path. Not always the case.
    action = form["action"]

    # Get rest of the fields, ignore password and email.
    for inp in form.find_all("input", {"name":True,"value":True}):
        name, value = inp["name"], inp["value"]
        if name not in data:
            data[name] = value
    # login
    s.post(urljoin(base, action), data=data)
    # get protected url
    with open(local_filename, "wb") as f:
        for chk in s.get(url_to_file, stream=True).iter_content(1024):
            f.write(chk)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thank you for your answer, but it seems that logging in doesn't work, probably because the form entries could not be found properly. I will try to figure this out. – Milan Skála Sep 29 '16 at 11:05
  • @MilanSkála, actually we need a token, I will edit in a minute – Padraic Cunningham Sep 29 '16 at 11:36
  • Hmm, weird, calling soup.select_one() method raises TypeError: 'NoneType' object is not callable. – Milan Skála Sep 29 '16 at 11:53
  • @MilanSkála, I ran the code to verify and it does what it is supposed to, are you passing the base url as I have above? https://support.codasip.com? – Padraic Cunningham Sep 29 '16 at 11:55
  • Yeah, I used your code 1:1. May this be caused by different version of bs4 module? I'm using 4.3.2. Btw, method 'select' works, but returns list, where the first item is html of the form (as string). – Milan Skála Sep 29 '16 at 12:02
  • I would definitely suggest you update, that version is almost 3 years old. You could just use index what select returns but really I would update if it were me. – Padraic Cunningham Sep 29 '16 at 12:03
  • Nice, it works as I desired. Thank you sir, you are the master. I really appreciate your time :) – Milan Skála Sep 29 '16 at 12:10
-1

Try downloading/writing it by chunks. Seems like file takes all your memory.

You should specify Range header for your request if server supports it.

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

Alex
  • 1,141
  • 8
  • 13