1

I need the ability to grab reports off of a particular website. The below method below does everything I need it to do, the only catch is the report, "report.csv", is served back with "content-disposition:filename=report.csv" in the response header when the page is posted (the page posts to itself).

def download_report
  page = @mechanize.click(@mechanize.current_page().link_with(:text => /Reporting/))
  page.form.field_with(:name => "rep").option_with(:value => "adperf").click

  page.form_with(:name => "get-report").field_with(:id => "sasReportingQuery.dateRange").option_with(:value => "Custom").click

  start_date = DateTime.parse(@start_date)
  end_date = DateTime.parse(@end_date)

  page.form_with(:name => "get-report").field_with(:name => "sd_display").value = start_date.strftime("%m/%d/%Y")
  page.form_with(:name => "get-report").field_with(:name => "ed_display").value = end_date.strftime("%m/%d/%Y")
  page.form_with(:name => "get-report").submit
end

As far as I can tell, Mechanize is not capturing the file anywhere that I can get to it. Is there a way to get Mechanize to capture and download this file?

@mechanize.current_page() does not contain the file and @mechanize.history() does not show that the file url was presented to Mechanize.

Bill Watts
  • 888
  • 7
  • 16
  • Your method doesn't return anything. Remove the last `p page` and it will return the response from the submit. – pguardiario Sep 19 '12 at 23:06
  • This is just an example, not the real code. Mechanize still just returns the posted page. `p page` is there to confirm that, that is the case. – Bill Watts Sep 20 '12 at 18:36
  • It doesn't confirm that at all though because page is not the returned value. – pguardiario Sep 20 '12 at 23:21
  • `p page` shows me what the value of `page` is in console, and it shows that the current page that mechanize holds is the page that has been posted to and not a file. Going back in mechanize's history also shows no file. – Bill Watts Sep 21 '12 at 00:24
  • No, page is the page with the form on it. The current page would be @mechanize.page – pguardiario Sep 21 '12 at 00:48
  • Yes your right `page` does return the page that the form is on. I removed it from my example and made the question a little more clear. Sorry for the confusion. `@mechanize.current_page()` and `@machanize.history()` however show no trace of a file (report.csv). – Bill Watts Sep 21 '12 at 01:32
  • Ok, now you can do: `file = download_report` – pguardiario Sep 21 '12 at 01:55
  • If I just return `page` and call my method it does return a file, yes. But it returns a file that is just the html of `page`, not the csv file that I need. When posting the form they are using either javascript or some different type content type delivery. It has been discussed below that Mechanize will never see the file when delivered in this manner. – Bill Watts Sep 21 '12 at 13:39
  • Yes but that's a separate issue from the one I pointed out. – pguardiario Sep 21 '12 at 22:55

2 Answers2

0

The server appears to be telling the browser to save the document. "Content-disposition:filename" is the clue to that. Mechanize won't know what to do with that, and will try to read and parse the content, which, if it's a CSV, will not work.

Without seeing the HTML page you're working with it's impossible to know exactly what mechanism they're using to trigger the download. Clicking an element could fire a JavaScript event, which Mechanize won't handle. Or, it could send a form to the server, which responds with the document download. In either case, you have to figure out what is being sent, why, and what specifically defines the document you want, then use that information to request the document.

Mechanize isn't the right tool to download an attachment. Use Mechanize to navigate forms, then use Mechanize's embedded Nokogiri to extract the URL for the document.

Then use something like curb or Ruby's built-in OpenURI to retrieve the attachment, or see "Using WWW:Mechanize to download a file to disk without loading it all in memory first" for more information.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • What if there is no URL to the document? I've seen cases where the end provider spits back the content with that header, and then starts the output of the HTML page. – lazyPower Sep 19 '12 at 19:00
  • Mechanize will still be in the way. In those cases you'd have to write code to handle the specific situation. Accessing some URL has to trigger the download, it just doesn't automatically start, so that's the part that has to be performed by something besides Mechanize. – the Tin Man Sep 19 '12 at 19:02
  • Ok Mechanize won't follow the js, that makes sense. Even more so since it's some js voodoo that seems to be the problem in question. So let me throw a wrench in this. The page in question is the reports page in LinkedIn ads. The when you submit the form the request comes in with application/csv type but a status of cancelled (in chrome). If you dive into the request, however, the status is 200 and there is a post response that you would normally see. So they are doing something via js in order to do their reports, because there is no link to the report passed back in post. Ideas? – Bill Watts Sep 19 '12 at 19:48
  • If you need to process JS, use Watir, or one of its spinoffs. That way you can have access to a running browser, which will handle the JavaScript for you. Or, manually step through the JavaScript using Firebug and figure out what URL is being sent. – the Tin Man Sep 19 '12 at 20:06
  • Watir could be an option, only problem is the method posted above lives in a Sinatra app and accepts connections from other services. No browser available. – Bill Watts Sep 20 '12 at 13:20
  • After talking with a few people about this situation, one person suggested that the file may be served using a `X-SendFile` or `X-AccelRedirect` response. – Bill Watts Sep 20 '12 at 18:41
0

Check the class of the returned page page.class. if it is File then you can just save it.

...
page = page.form_with(:name => "get-report").submit
page.class # File?
page.save('path/to/file')
Headshota
  • 21,021
  • 11
  • 61
  • 82