0

I'm trying to get/download some files from an url. I'm make a tiny script in ruby to get this files. Follow the script:

require 'nokogiri'
require 'open-uri'

(1..2).each do |season|
  (1..3).each do |ep|
    season = season.to_s.rjust(2, '0')
    ep = ep.to_s.rjust(2, '0')

    page = Nokogiri::HTML(open("https://some-url/s#{season}e{ep}/releases"))
    page.css('table.table tbody tr td a').each do |el|
      link = el['href']
      `curl "https://some-url#{link}"` if link.match('sujaidr.srt$')
    end
  end
end
puts "done"

But the response from curl is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: 
<a href="/some-url/friends-s0Xe0Y/releases">/some-url/s0Xe0Y/releases</a>.  If not click the link.

When I use wget the redirected page is downloaded. I tried to set the user agent but not works. The server always redirect the link only when I try download the files through curl or others cli's like wget, aria2c, httpie, etc. And I can't find any solution for now.

How can I do this?


Solved

I decide use Watir webdriver to do this. Works great for now.

Community
  • 1
  • 1
  • Sounds like a header or cookie is missing. – Stefan Aug 16 '18 at 15:15
  • Wget follows redirects automatically. Are you sure wget is not just following the redirect and then downloading? – Casper Aug 16 '18 at 15:56
  • @Casper, thats the point. Wget follows the redirect and download the html page redirected not the file what I want. Understand? – Sinésio Neto Aug 16 '18 at 21:57
  • I try using `curl -L --max-redirs 0` options but returns the this `curl: (47) Maximum (0) redirects followed`. I know the [`-L` option tells the Curl to follows the HTTP redirects](https://ec.haxx.se/http-redirects.html). But if without it the Curl return the html page that I've been cited. – Sinésio Neto Aug 16 '18 at 22:31
  • If the server responds with a redirect, then the file is not there. You can't download it, if the server doesn't provide it at that address. This is not a problem of not following redirects, this is a problem of the server responding with a redirect instead of what you are expecting. I would debug this with a browser and its network monitor first. If you download it with the browser, is it being redirected too? This should be visible in the browser debugger. – Casper Aug 16 '18 at 23:17
  • Some websites protect their download links with JavaScript in order to make scraping harder. My guess is you might be running into an issue like that (unless it's a cookie issue, like Stefan commented). – Casper Aug 16 '18 at 23:18
  • @Casper, answer you question "If you download it with the browser, is it being redirect too?": If I copy the file link address using the inspect devtools and paste in to the address bar, it is being redirect. But if I click on button to download file, it is not redirect and download the file. – Sinésio Neto Aug 17 '18 at 01:01
  • @Casper, I think you are right about the website protecting their download links with JS. When I inspect the html element, the link is under an `onclick` event. I'm not found a way to work around the problem yet. – Sinésio Neto Aug 17 '18 at 01:08
  • You want to look into using a headless browser setup like PhantomJS. This will run the JavaScript inside a virtual browser that you can control through Ruby. – Casper Aug 17 '18 at 02:08

1 Answers1

0

If you want to download the file, rather then the page doing the redirection try using the option -L within your code for example:

curl -L "https://some-url#{link}"

From the curl man:

-L, --location
              (HTTP) If the server reports that the requested page has moved to a different
              location  (indicated  with  a  Location:  header  and  a  3XX
              response  code),  this  option will make curl redo the request on
              the new place.

If you are using ruby, instead of calling curl or other 3rd party tools, you may cat to use something like this:

require 'net/http'
# Must be somedomain.net instead of somedomain.net/, otherwise, it will throw exception.
Net::HTTP.start("somedomain.net") do |http|
    resp = http.get("/flv/sample/sample.flv")
    open("sample.flv", "wb") do |file|
        file.write(resp.body)
    end
end
puts "Done."

Check this answer from where the example came out: https://stackoverflow.com/a/2263547/1135424

nbari
  • 25,603
  • 10
  • 76
  • 131