1

I'm working through tutorials at http://ruby.bastardsbook.com/chapters/web-crawling/ and would like a little clarification on the Handling Redirects one, because the DOD website that the author uses as an example has been remade since the time of writing and I have run into some unexpected results while adjusting his code to work with the current version. (Please note that I don't need help rewriting the code, I'm just wondering why the stuff that happens here happens)

Specifically, I get code 301 no matter whether the page I'm trying to get with Net::HTTP.get_response exists or not. For example:

require 'net/http'

VALID = 'https://www.defense.gov/News/Contracts/Contract-View/Article/14038760'
INVALID = 'https://www.defense.gov/News/Contracts/Contract-View/Article/14038759'

resp = Net::HTTP.get_response(URI.parse(VALID))
puts resp.code # 301

resp = Net::HTTP.get_response(URI.parse(INVALID))
puts resp.code # 301

So, why does a valid address return a 301 Moved Permanently? And not only that, but actually trying to follow that redirect (useless in the scope of that tutorial, since the whole point was to skip anything that isn't a 2xx) as suggested here Ruby Net::HTTP - following 301 redirects gives me a 404, presumably because the redirect link has a trailing slash.

if resp.code == '301'
  resp = Net::HTTP.get_response(URI.parse(resp.header['location']))
end
puts resp.code # 404

Even more puzzling to me is that when I looked at resp.body I found that despite that 404 error, I had, in fact, successfully downloaded the page's contents.

I would be very grateful if somebody walked me through whatever exactly is going on here. Thank you for your help and for taking your time in advance.

m.l
  • 13
  • 3

1 Answers1

0

It doesn't seem like Ruby issue but just www.defense.gov manner. https://www.defense.gov/News/Contracts/Contract-View/Article/14038760 gives redirect (301) and then 404 despite the way to get it.

https://www.defense.gov/News/Contracts/Contract-View/Article/14038760 seems like a url to some missing data but https://www.defense.gov/News/Contracts/Contract-View/Article/1403876/ works fine (actual for 26.17.2017 03:24 +7). Why do you think the url with id 14038760 is valid?

I've found out that https://www.defense.gov/News/Contracts/Contract-View/Article/1403876 redirects to https://www.defense.gov/News/Contracts/Contract-View/Article/1403876/ (the same url but with trailing slash) while the url with trailing slash gives 200 response immediately.

What you can do? Try to get here https://www.defense.gov/News/Contracts/source/nav/ a list of actual contracts first and then request each of them with separated requests.

user3309314
  • 2,453
  • 2
  • 17
  • 30
  • Haha, you're right, I have no idea how that 0 got stuck to the end of the link, I must have typed it manually at some point and didn't notice. And, yes, grabbing the actual list of contracts seems like a much more reasonable idea than checking like 1.5 million url's for validity, I don't know why that's not what the guy does in his tutorial, probably just to show the difference between net/http and nokogiri. Thank you! – m.l Dec 26 '17 at 12:07
  • Oh, I rechecked and back when he wrote the tutorial contract id's topped at just 4653, so it wasn't such an insane thing to do. – m.l Dec 26 '17 at 12:23