3

I am using robobrowser to login in to a password protected website. I am able to download html code and edit it. However, when I use following method:

br = RoboBrowser(history=True)
url = 'https://dummywebsite.html/dummy.pdf'
br.open(url)
pdf_file = '/localdir/local.pdf'
with open(pdf_file, 'wb') as output:
    output.write("%s" % (br.parsed))

However, the output is not valid pdf file. Same happens when I try to download images. I have gone through documentation but couldn't find anything yet. The alternative to this seems mechanize. However, there is no python 3 support for that.

I would be grateful with help or pointers to look forward. Also, any other alternative if robobrowser cannot handle this would be great help.

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
user984201
  • 111
  • 2
  • 12
  • Can you login with SimpleAuth? `http://login:password@url`? If so, when you could use `urllib2` module. – Jimilian Feb 17 '15 at 11:17
  • I think `br.parsed` is probably not what you want, as the documentation says that this returns "... parse[d] response content" . Perhaps RoboBrowser supports a way to read the raw HTTP response body, or you might have an easier time with `urllib2` if you can handle the authentication. – Josh Kupershmidt Feb 17 '15 at 14:52

2 Answers2

2

You could try to use the requests.session object that is also available with RoboBrowser:

url = "https://dummywebsite.html/dummy.pdf"
pdf_file_path = "/localdir/local.pdf"

browser = RoboBrowser(history=True)
# do the login (e.g. via a login form)
request = browser.session.get(url, stream=True)

with open(pdf_file_path, "wb") as pdf_file:
    pdf_file.write(request.content)

This method also allows you to access files that are only available after you are logged in (this information is usually stored in the HTTP session).

  • Not sure the `stream=True` option is working there (if the file is small it has no effect, and if the file is large the code may fail). – Jean Paul Jan 27 '21 at 18:10
1

You have to get the whole content of the returned page (the PDF) into the file. This code should work:

br = RoboBrowser(history=True)
url = 'https://dummywebsite.html/dummy.pdf'
br.open(url)
pdf_file = '/localdir/local.pdf'

content = br.response.content

with open(pdf_file, "wb") as output:
  output.write(content)
burgund
  • 165
  • 1
  • 5