0

So, I'm trying to automate the downloading of images from a website for which you have to login. The login form is on every page (in the browser you click "login" and a javascript slidedown occurs revealing the form). I login using the below code and when I get to agent.get( "http://cdn.com/some_image.jpg" ), a 403 error is thrown. This doesn't happen when I login into the browser and visit "http://cdn.com/some_image.jpg", so what is going on and how can I get around it?

path = "http://www.example.com/some_path"

agent = Mechanize.new

page = agent.get(path) do |page|
      form = page.form_with(action: "http://www.example.com/authorize")
      username_field = form.field_with(name: "username")
      username_field.value = "some_user"
      password_field = form.field_with(name: "password")
      password_field.value = "password"
      form.submit
    end

agent.get( "http://cdn.com/some_image.jpg" ).save "some_image.jpg" unless File.exist?("some_image.jpg")
Nona
  • 5,302
  • 7
  • 41
  • 79

2 Answers2

1

Think about this: you submitted a login request, and then a request for the image. How does the server know that you are the person that logged in from the first request? Tracking by IP (could be shared or a proxy), port (wouldn't tpyically survive multiple requests), user agent (not unique), etc obviously wouldn't work. Typically login sessions are implemented using cookies - a web client is given a session token in the form of a cookie, which, when presented back to the server in a subsequent request, informs the server of the session to which the request belongs, thus allowing the server to track logins across what are otherwise stateless web requests.

There are other methods, but they mostly resolve around passing this token in another way ( custom header, GET URL parameters, etc ) - with the notable exception of signed web requests such as AWS uses (cool, but not very common for web logins). All in all, session cookies are by far the most common implementation.

Thus, I suggest you take a look at this post, as there seems to be a method of managing cookies within the mechanize gem for use with subsequent requests.

Maintaining cookies between Mechanize requests

Community
  • 1
  • 1
erik258
  • 14,701
  • 2
  • 25
  • 31
1

From a cdn I would guess they're checking user-agent or referer.

Mechanize should be setting the referer properly, so that leaves user-agent.

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • But my user agent is set - would they also be blocking certain browsers ? – Nona Jan 03 '15 at 00:39
  • 1
    Good point, you may be right. People have a tendency to mung up relevant data when sanitizing their URLs; it would be interesting to know how, otherwise, www.example.com could authorize a request to cdn.com. – erik258 Jan 03 '15 at 00:39
  • 1
    Try setting user-agent and referer to the same headers your browser sends and see. – pguardiario Jan 03 '15 at 00:43
  • I set the user-agent to same header as my browser sends. Also, no referer seems to be set (using HTTPTrace Chrome plugin), but 3 cookies (which cannot be found in agent.cookies for Mechanize) are being passed along via the browser request that are not being passed along in the Mechanize request. *Cookie: _gat=1; __asc=cxxxx93714xxd2baac454f6888c; __auc=4da96ae114aacd6123456e8e8b4; _ga=GA1.2.141234142.1123438953* What does this mean? I see the cookies in my browser, but can't access them via the Mechanize *agent* variable. – Nona Jan 03 '15 at 00:55
  • In that same browser request to *http://cdn.com/some_image.jpg* that receives a 200 response, the browser also makes two requests to *http://cdn.com/favicon.ico* and receives a 403 response each time. – Nona Jan 03 '15 at 01:18
  • 1
    You can ignore the favicon.ico request. It sounds like some javascript is happening to set a cookie with the CDN. Mechanize doesn't do those, so it might be best to switch so something that does, for example watir-webdriver. – pguardiario Jan 03 '15 at 01:25