Download images and pdf using python (robobrowser)

Question

I am using robobrowser to login in to a password protected website. I am able to download html code and edit it. However, when I use following method:

br = RoboBrowser(history=True)
url = 'https://dummywebsite.html/dummy.pdf'
br.open(url)
pdf_file = '/localdir/local.pdf'
with open(pdf_file, 'wb') as output:
    output.write("%s" % (br.parsed))

However, the output is not valid pdf file. Same happens when I try to download images. I have gone through documentation but couldn't find anything yet. The alternative to this seems mechanize. However, there is no python 3 support for that.

I would be grateful with help or pointers to look forward. Also, any other alternative if robobrowser cannot handle this would be great help.

Can you login with SimpleAuth? `http://login:password@url`? If so, when you could use `urllib2` module. — Jimilian, Feb 17 '15 at 11:17
I think `br.parsed` is probably not what you want, as the documentation says that this returns "... parse[d] response content" . Perhaps RoboBrowser supports a way to read the raw HTTP response body, or you might have an easier time with `urllib2` if you can handle the authentication. — Josh Kupershmidt, Feb 17 '15 at 14:52

score 2 · Answer 1 · answered Mar 17 '15 at 12:28

You could try to use the requests.session object that is also available with RoboBrowser:

url = "https://dummywebsite.html/dummy.pdf"
pdf_file_path = "/localdir/local.pdf"

browser = RoboBrowser(history=True)
# do the login (e.g. via a login form)
request = browser.session.get(url, stream=True)

with open(pdf_file_path, "wb") as pdf_file:
    pdf_file.write(request.content)

This method also allows you to access files that are only available after you are logged in (this information is usually stored in the HTTP session).

Not sure the `stream=True` option is working there (if the file is small it has no effect, and if the file is large the code may fail). — Jean Paul, Jan 27 '21 at 18:10

score 1 · Answer 2 · answered Oct 06 '16 at 17:13

You have to get the whole content of the returned page (the PDF) into the file. This code should work:

br = RoboBrowser(history=True)
url = 'https://dummywebsite.html/dummy.pdf'
br.open(url)
pdf_file = '/localdir/local.pdf'

content = br.response.content

with open(pdf_file, "wb") as output:
  output.write(content)

Download images and pdf using python (robobrowser)

2 Answers2