I'm working through tutorials at http://ruby.bastardsbook.com/chapters/web-crawling/ and would like a little clarification on the Handling Redirects one, because the DOD website that the author uses as an example has been remade since the time of writing and I have run into some unexpected results while adjusting his code to work with the current version. (Please note that I don't need help rewriting the code, I'm just wondering why the stuff that happens here happens)
Specifically, I get code 301 no matter whether the page I'm trying to get with Net::HTTP.get_response
exists or not. For example:
require 'net/http'
VALID = 'https://www.defense.gov/News/Contracts/Contract-View/Article/14038760'
INVALID = 'https://www.defense.gov/News/Contracts/Contract-View/Article/14038759'
resp = Net::HTTP.get_response(URI.parse(VALID))
puts resp.code # 301
resp = Net::HTTP.get_response(URI.parse(INVALID))
puts resp.code # 301
So, why does a valid address return a 301 Moved Permanently? And not only that, but actually trying to follow that redirect (useless in the scope of that tutorial, since the whole point was to skip anything that isn't a 2xx) as suggested here Ruby Net::HTTP - following 301 redirects gives me a 404, presumably because the redirect link has a trailing slash.
if resp.code == '301'
resp = Net::HTTP.get_response(URI.parse(resp.header['location']))
end
puts resp.code # 404
Even more puzzling to me is that when I looked at resp.body
I found that despite that 404 error, I had, in fact, successfully downloaded the page's contents.
I would be very grateful if somebody walked me through whatever exactly is going on here. Thank you for your help and for taking your time in advance.