1

When I visit

https://www.duckduckgo.com/?q=!ducky+goodreads+quotes+A+Promised+Land+Barack+Obama&t=h_&ia=web

It ultimately redirects to

https://www.goodreads.com/work/quotes/86336100-a-promised-land

Using Ruby, is there a way to pass in the first duckduckgo page, but collect the final url that this redirects to?

I tried using

res = Net::HTTP.get_response(URI('https://www.duckduckgo.com/?q=!ducky+goodreads+quotes+A+Promised+Land+Barack+Obama&t=h_&ia=web'))
puts res['location']

but this only outputs the same duckduckgo link.

sharataka
  • 5,014
  • 20
  • 65
  • 125
  • 1
    It's not the same duckduckgo link. The `location` header in the redirect uses the `duckduckgo.com` hostname rather than `www.duckduckgo.com`. When requesting this returned URL, the response contains some HTML/Javascript which performs the final redirect. – Holger Just Dec 12 '22 at 15:57
  • There are two additional redirects and each page contains a `` element as well as a script tag. You could use Nokogiri to parse the page and get the URL. But be warned that its going to be a very brittle solution. – max Dec 12 '22 at 16:42
  • The [REFRESH header](https://codingshower.com/http-refresh-header-meta-refresh/) is an old non-standard header dating back to the days of Netscape which may cause the browser to redirect. Or not. – max Dec 12 '22 at 16:44
  • @max do you have advice on how that solution could/would work? – sharataka Dec 12 '22 at 18:42
  • @HolgerJust thanks for the information, but I'm not sure what I should specifically add in Ruby to help me get a solution. do you have thoughts? thank you in advance! – sharataka Dec 12 '22 at 18:43
  • I'm not going to do your job for you. There are plenty of tutorials on how to parse HTML with Nokogiri as well as serveral questions which answer how to follow the initial redirect with Net::HTTP. https://stackoverflow.com/questions/6934185/ruby-net-http-following-redirects – max Dec 12 '22 at 20:25
  • hi @max, I didn't mean to ask you to do all the work :). when I tried the upvoted answer and replaced the link above I'm trying to get to work, I still keep getting an error (400 "Bad Request")...so I'm at a bit of a loss on how to proceed. I'm not sure if the duckduckgo url above is actually redirecting or doing something else to get to the final result, maybe that's the issue? – sharataka Dec 15 '22 at 14:51

1 Answers1

1

TL;DR

DuckDuckGo is sending you on multiple redirects, and some are through javascript. You'll need to either follow all these redirects manually with Net::HTTP and try to pull URLs out of the javascript, or use a different tool like Selenium Ruby or Capybara which can execute the javascript.

Using Ruby, is there a way to pass in the first duckduckgo page, but collect the final url that this redirects to?

In short, it would be quite difficult. You'd have to write quite a bit of custom code, and there are much better tools for this.

What the browser is doing (the full story)

Here's the full story of what DuckDuckGo is doing with your requests:

Request #1

URL: https://www.duckduckgo.com/?q=!ducky+goodreads+quotes+A+Promised+Land+Barack+Obama&t=h_&ia=web. This returns a 301 redirect to: https://duckduckgo.com/?q=!ducky+goodreads+quotes+A+Promised+Land+Barack+Obama&t=h_&ia=web. Note that the 'www' is not in the returned URL.

Request #2

So we make a request to the new URL without the 'www':

redirected_url = res['location']
res = Net::HTTP.get_response(URI(redirected_url))

If we were in the browser, we would be directed again to the goodreads site. However, if we inspect the response (i.e. res) they're not sending a true 301 redirect, duckduckgo is actually doing it with javascript. Here's the output of res.body:

<html><head><meta http-equiv='Content-Type' content='text/html; charset=utf-8'><meta name='referrer' content='origin'><meta name='robots' content='noindex, nofollow'><meta http-equiv='refresh' content='0; url=/l/?uddg=https%3A%2F%2Fwww.goodreads.com%2Fwork%2Fquotes%2F86336100%2Da%2Dpromised%2Dland&rut=c5d81c30df243e291a04d906995e775c2f7c1bec359e8efa2ac0451aa701a8bf'></head><body><script language='JavaScript'>function ffredirect(){window.location.replace('/l/?uddg=https%3A%2F%2Fwww.goodreads.com%2Fwork%2Fquotes%2F86336100%2Da%2Dpromised%2Dland&rut=c5d81c30df243e291a04d906995e775c2f7c1bec359e8efa2ac0451aa701a8bf');}setTimeout('ffredirect()',100);</script></body></html>

If you scroll through the text above, you'll notice a <script> tag with a window.location.replace(...). This uses javascript to redirect our browser to another URL.

Request #3

Now, our browsers will follow the URL that's within that window.location.replace javascript call. However, that's tough to do with Ruby. It's likely DuckDuckGo implemented this as a security measure to prevent scraping, or as a way to track data. Either way, it would be tough to parse this from the javascript.

The result is that we are sent to a new page, something like: https://duckduckgo.com/l/?uddg=https%3A%2F%2Fwww.goodreads.com%2Fwork%2Fquotes%2F86336100%2Da%2Dpromised%2Dland&rut=c5d81c30df243e291a04d906995e775c2f7c1bec359e8efa2ac0451aa701a8bf

Request #4

Finally, this last page is the one that redirects us to GoodReads, again through Javascript.

In short, check out Selenium Ruby or Capybara. They have excellent documentation, and it should be able to support what you're trying to do! Good luck!

Matt
  • 5,800
  • 1
  • 44
  • 40