Mechanize Rails - Web Scraping - Server responds with JSON - How to Parse URL from to Download CSV

Question

I am new to Mechanize and trying to overcome this probably very obvious answer.

I put together a short script to auth on an external site, then click a link that generates a CSV file dynamically.

I have finally got it to click on the export button, however, it returns an AWS URL.

I'm trying to get the script to download said CSV from this JSON Response (seen below).

Myscript.rb

require 'mechanize'
require 'logger'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'zlib'    
USERNAME = "myemail"
    PASSWORD = "mysecret"
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"

    mechanize = Mechanize.new do |a|
      a.user_agent = USER_AGENT
    end

    form_page = mechanize.get('https://XXXX.XXXXX.com/signin')
    form = form_page.form_with(:id =>'login')
    form.field_with(:id => 'user_email').value=USERNAME
    form.field_with(:id => 'user_password').value=PASSWORD
    page = form.click_button

    donations = mechanize.get('https://XXXXX.XXXXXX.com/pages/ACCOUNT/statistics')
    puts donations.body

    donations = mechanize.get('https://xxx.siteimscraping.com/pages/myaccount/statistics')
    bs_csv_download = page.link_with(:text => 'Download CSV')

JSON response from website containing link to CSV I need to parse and download via Mechanize and/or nokogiri.

{"message":"Find your report at https://s3.amazonaws.com/reports.XXXXXXX.com/XXXXXXX.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=AKIAIKW4BJKQUNOJ6D2A%2F20190228%2Fus-east-1%2Fs3%2Faws4_request\u0026X-Amz-Date=20190228T025844Z\u0026X-Amz-Expires=86400\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=b19b6f1d5120398c850fc03c474889570820d33f5ede5ff3446b7b8ecbaf706e"}

I very much appreciate any help.

Have you tried [Mechanize::Download](https://stackoverflow.com/a/9105153/5399937). The get URL could be `JSON.parse(json_response)['message'][/(http.+)/, 1]` assuming the message always ends with the address. — Tom, Feb 28 '19 at 05:48

mkrl · Answer 1 · 2019-02-28T04:13:27.940

0

You could parse it as JSON and then retrieve a substring from the response (assuming it always responds in the same format):

require 'json'

...

bs_csv_download = page.link_with(:text => 'Download CSV')
json_response = JSON.parse(bs_csv_download)
direct_link = json_response["message"][20..-1]
mechanize.get(direct_link).save('file.csv')

We're getting the 20th character in the "message" value with [20..-1] (-1 means till the end of the string).

edited Feb 28 '19 at 04:13

answered Feb 28 '19 at 03:57

mkrl

740
6
20

It seems close but I'm stuck on an error when trying the above - "`initialize': no implicit conversion of nil into String (TypeError)" – GottaTinker Feb 28 '19 at 04:54
What's the error? It's not too clear what variable in your question contains the JSON response, is it `bs_csv_download`? – mkrl Feb 28 '19 at 04:56
You're getting `nil` as some of your responses. In my example, it is considered that `bs_csv_download` variable will contain the JSON response. It's hard to guess what your responses are because I don't know anything about the structure of the site you're parsing. – mkrl Feb 28 '19 at 05:50

Mechanize Rails - Web Scraping - Server responds with JSON - How to Parse URL from to Download CSV

1 Answers1