Ruby Mechanize 404 => Net::HTTPNotFound

Question

I have an URL that I can't access with Mechanize and I don't know why:

# Use ruby 2.1.6
require 'mechanize'
require 'axlsx' # 2.0.1
require 'roo' # 1.13.2

mechanize = Mechanize.new
mechanize.request_headers = { "Accept-Encoding" => "" }
mechanize.ignore_bad_chunking = true
mechanize.follow_meta_refresh = true

xlsx = Roo::Excelx.new("./base_list.xlsx")

xlsx.each_with_pagename do |page, sheet|
  sheet.each do |row|
    page = mechanize.get(row[0])
  end
end

When I iterate on my list I get urls like : https://angel.co/_helencousins, I can access it with my browser but not with Mechanize, and I have this error:

/.rvm/gems/ruby-2.1.6/gems/mechanize-2.7.4/lib/mechanize/http/agent.rb:316:in `fetch': 404 => Net::HTTPNotFound for https://angel.co/_helencousins -- unhandled response (Mechanize::ResponseCodeError)
    from /Users/xxx/.rvm/gems/ruby-2.1.6/gems/mechanize-2.7.4/lib/mechanize.rb:464:in `get'
    from scraper.rb:15:in `block (2 levels) in <main>'
    from /Users/xxx/.rvm/gems/ruby-2.1.6/gems/roo-1.13.2/lib/roo/base.rb:428:in `block in each'
    from /Users/xxx/.rvm/gems/ruby-2.1.6/gems/roo-1.13.2/lib/roo/base.rb:427:in `upto'
    from /Users/xxx/.rvm/gems/ruby-2.1.6/gems/roo-1.13.2/lib/roo/base.rb:427:in `each'
    from scraper.rb:14:in `block in <main>'
    from /Users/xxx/.rvm/gems/ruby-2.1.6/gems/roo-1.13.2/lib/roo/base.rb:398:in `block in each_with_pagename'
    from /Users/xxx/.rvm/gems/ruby-2.1.6/gems/roo-1.13.2/lib/roo/base.rb:397:in `each'
    from /Users/xxx/.rvm/gems/ruby-2.1.6/gems/roo-1.13.2/lib/roo/base.rb:397:in `each_with_pagename'
    from scraper.rb:13:in `<main>'

Possible problem: http://stackoverflow.com/questions/8567973/why-does-accessing-a-ssl-site-with-mechanize-on-windows-fail-but-on-mac-work — 7stud, Jan 12 '16 at 16:55
The odds are very good they're sniffing the user-agent information for the request and refusing anything that isn't a standard browser. First, you should look to see if they have an API, and, if so, use it. If they don't look for their TOS and see if they allow mechanized scraping. If so try changing your user-agent string or contact their site admin and ask for help. — the Tin Man, Jan 12 '16 at 19:11

score 4 · Accepted Answer · answered Jan 13 '16 at 08:47

4

Ok,

The problème was that the website disable the Mechanize user agent.

I just changed it to : mechanize.user_agent_alias = 'Windows Chrome'

answered Jan 13 '16 at 08:47

Ismael Bourg

197
11

Ruby Mechanize 404 => Net::HTTPNotFound

1 Answers1