16

There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.

Using wget, I can download the file from a different address, but not from the address that works in Chrome.

This is what I've tried to do:

url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'  

wget -c  $url --user-agent="" -O  xfgs.f4v

This doesn't work either:

wget -c  $url   -O  xfgs.f4v

The output is:

Connecting to 118.26.57.12:80... connected.  
HTTP request sent, awaiting response... 403 Forbidden  
2013-02-13 09:50:42 ERROR 403: Forbidden.  

What am I doing wrong?

I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:

import mechanize  
br = mechanize.Browser()  
br = mechanize.Browser()  
br.set_handle_robots(False)  
br.set_handle_equiv(False)   
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]  
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302' 
r = br.open(url).read()  
tofile=open("/tmp/xfgs.f4v","w")  
tofile.write(r)  
tofile.close()

This is the result:

Traceback (most recent call last):  
  File "<stdin>", line 1, in <module>  
  File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open  
   return self._mech_open(url, data, timeout=timeout)  
  File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open  
raise response  
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden

Can anyone explain how to get the mechanize code to work please?

Jens
  • 20,533
  • 11
  • 60
  • 86
showkey
  • 482
  • 42
  • 140
  • 295
  • 1
    What happens if you use: `user_agent='Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1)'` and then `wget -c "${url}" --user-agent="${user_agent}" -O xfgs.f4v`? If that doesn't work, then Python may not be able to help you! – johnsyweb Feb 13 '13 at 02:45
  • 7
    The reason you get a 403 response is most likely because the website keeps a state when you visit it in a browser, most likely a cookie. That's what YouTube does. Export your cookie from the browser and set it in wget (you can simply use the "Cookie:" header) and it should work. – Attila O. Feb 13 '13 at 02:47
  • @Johnsyweb Python can perfectly emulate a browser in most cases (well, except odd sites that set a cookie via JavaScript and such). – Attila O. Feb 13 '13 at 02:49
  • @AttilaO. Right, there is no reason that your python can not port the functionality of javascript either. They are nearly identical languages syntactically (besides pythons beautify significant whitespace and different libraries) – G. Shearer Feb 22 '13 at 17:19
  • 1
    And be wary of driving a browser except for the most obscure of scraping problems. For fairly simple ones like this, it is better to have a pure-python solution, so you can run it anywhere and it is fast as all heck. Using Selenium or something similar will suddenly require you to have MORE languages and/or dependent applications installed on the box that runs your script. Don't overcomplicate the issue if you don't need to. – G. Shearer Feb 22 '13 at 17:21

11 Answers11

26

First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.

If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.

Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.

Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.

Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.

When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.

Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.

If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.

Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.

Edit: Clarify steps

  1. Investigate how state is being maintained
  2. Pull initial page with python, grab any state info you need from it
  3. Perform any tokenizing that may be required with that state info
  4. Pull the video using the tokens from steps 2 and 3
  5. If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
  6. Return to step 1. until you find a solution

Edit: You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.

I love this kind of problem

G. Shearer
  • 2,175
  • 17
  • 19
  • 6
    honestly i like such answers, sounds like metaphor of 'don't give me fish but learn me to fish'. And these are good hints. – philippe lhardy Feb 22 '13 at 20:55
  • Well this is programming right? :) thanks, the best way to learn is by doing (debugging). Break things and find out why they broke. Then you understand the entire system better, rather than just getting by with one problem-domain-specific solution – G. Shearer Feb 24 '13 at 18:51
  • To automate web stuffs, I often used these three tools: - Developer Tools of Chrome with enable the option 'Preserve Log' for Network and Console - When POST/GET requests are identified (you need some skills for that), I use 'Copy as curl' - Then write my own python to automate the case, I also found this tool: http://curl.trillworks.com/#python that generate code for me. You MAY also need to use cookiejar of python/urllib2 to persist cookies You can also use nodejs or phantomjs to execute/eval easily Javascript codes, inside your code – hzrari Apr 16 '15 at 21:58
6

It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing

Eric
  • 19,525
  • 19
  • 84
  • 147
5
  1. you can use selenium or watir to do all the stuff you need in a browser.
  2. since you don't want to see the browser, you can run selenium headless.

see also this answer.

Community
  • 1
  • 1
dnozay
  • 23,846
  • 6
  • 82
  • 104
5

Assuming that you did not type the URL out of the blue by hand, use mechanize to first go to the page where you got that from. Then emulate the action you take to download the actual file (probably clicking a link or a button).

This might not work though as Mechanize keeps state of cookies and redirects, but does not handle any JavaScript real-time changes to the html pages. To check if JavaScript is crucial for the operation, switch of JavaScript in Chrome (or any other browser) and make sure you can download the file. If JavaScript is necessary, I would try and programmatically drive a browser to get the file.

My usual approach to trying this kind of scraping is

  1. try wget or pythons urllib2
  2. try mechanize
  3. drive a browser

Unless there is some captcha, the last one usually works, but the others are easier (and faster).

Anthon
  • 69,918
  • 32
  • 186
  • 246
4

In order to clarify the "why" part of your question you can route your browser and your code's requests through a debug proxy. If you are using windows I suggest fiddler2. There exist other debug proxies for other platforms as well. But fiddler2 is definitely my favourite.

http://www.fiddler2.com/fiddler2/

https://www.owasp.org/index.php/Category:OWASP_WebScarab_Project

http://www.charlesproxy.com/

Or more low level http://netcat.sourceforge.net/

http://www.wireshark.org/

Once you know the differences it is usually much simpler to come up with a solution. I suspect that the other answers with regard to stateful browsing / cookies are correct. With the mentioned tools you can analyze these cookies and roll a suitable solution without going for browser automation.

Udo Klein
  • 6,784
  • 1
  • 36
  • 61
3

I think many sites use temporary links that only exist in your session. The code in the url is probably something like your session-id. That means the particular link will never work again.

You'll have to reopen the page that contains the link using some library that accomodates this session (like mentioned in other answers). And then try to locate the link and only use it in this session.

Rembunator
  • 1,305
  • 7
  • 13
2

There's an open source, Python library, named ghost, that wraps a headless, WebKit browser, so you can control everything through a simple API:

from ghost import Ghost
ghost = Ghost()

page, resources = ghost.open('http://my.web.page')

It supports cookies, JavaScript and everything else. You can inject JavaScript into the page, and while it's headless, so it doesn't render anything graphically, you still have the DOM. It's a complete browser.

It wouldn't scale well, but it's lots of fun, and may be useful when you need something approaching a complete browser.

Carl Smith
  • 3,025
  • 24
  • 36
  • 1
    This is probably overkill for his problem, since its loading up a full WebKit/JavascriptCore or V8 instance. That makes it non-pure python and dependencies may become a problem depending on where his code runs. But regardless awesome suggestion! I had never seen this before. – G. Shearer Feb 22 '13 at 17:09
2

While the current accepted answer (by G. Shearer) is the best possible advice for scraping in general, I've found a way to skip a few steps - with a firefox extension called cliget that takes the request context with all the http headers and cookies and generates a curl (or wget) command that is copied to the clipboard.

EDIT: this feature is also available in the network panels of firebug and the chrome debugger - right click request, "copy as curl"

Most of the time you'll get a very verbose command with a few apparently unneeded headers, but you can remove those one by one until the server rejects the request, instead of the opposite (which, honestly, I find frustrating - I often got stuck thinking what header was missing from the request).

(Also, you might want to remove the -O option from the curl commandline to see the result in stdout instead of downloading it to a file, and add -v to see the full header list)

Even if you don't want to use curl/wget, converting one curl/wget commandline to python code is just a matter of knowing how to add headers to an urllib request (or any http request library for that matter)

dequis
  • 2,100
  • 19
  • 25
0
from urllib import urlopen
print urlopen(url) #python built-in high level interface to get ANY online resources, auto responds to HTTP error codes.
noɥʇʎԀʎzɐɹƆ
  • 9,967
  • 2
  • 50
  • 67
0

Did you try requests module? it's much simpler to use than urllib2 and pycurl etc. yet it's powerful. it has following features: The link is here

  • International Domains and URLs
  • Keep-Alive & Connection Pooling
  • Sessions with Cookie Persistence
  • Browser-style SSL Verification
  • Basic/Digest Authentication
  • Elegant Key/Value Cookies
  • Automatic Decompression
  • Unicode Response Bodies
  • Multipart File Uploads
  • Connection Timeouts
  • .netrc support
  • Python 2.6—3.3
  • Thread-safe.
  • 0

    You could use Internet Download Manager it is able to capture and download any streaming media from any website

    Hannibal NH
    • 33
    • 1
    • 5