0

I am trying to understand how this web site is working. There is an input form where you can provide a url. This form returns information retrieved from another site (Youtube). So:

  1. My first and more interesting question is if anybody has any idea how this site retrieve the entire corpus of statements?

  2. Alternatively, since now I am using the following code:

    from BeautifulSoup import BeautifulSoup
    import json
    
    urlstr = 'http://www.sandracires.com/en/client/youtube/comments.php?v=' + videoId + '&page=' + str(npage)
    url = urllib2.urlopen(urlstr)
    content = url.read()
    soup = BeautifulSoup(content)
    #parse json
    newDictionary=json.loads(str(soup)) 
    
    #print example
    print newDictionary['list'][1]['username']
    

    However, I can not iterate in all pages (which is not happening when I to that manually). I have placed timer.sleep(30) below json but without success. Why is that happening?

Thanks!

Python 2.7.8

Thoth
  • 993
  • 12
  • 36

1 Answers1

0
  1. Probably using the Google Youtube data API. Note that (presently) comments can only be retrieved using version 2 of the API - which has been deprecated. Apparently no support yet in V3. Python clients libraries are available, see https://developers.google.com/youtube/code#Python.

  2. Response is already JSON, no need for BS. The web server seems to require cookies, so I recommend using requests module, in particular its session management:

    import requests
    
    videoId = 'ZSzeFFsKEt4'
    results = []
    npage = 1
    session = requests.session()
    while True:
        urlstr = 'http://www.sandracires.com/en/client/youtube/comments.php'
        print "Getting page ", npage
        response = session.get(urlstr, params={'v': videoId, 'page': npage})
        content = response.json()
        if len(content['list']) > 1:
            results.append(content)
        else:
            break
        npage += 1
    
    print results
    
Thoth
  • 993
  • 12
  • 36
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • Thanks for the response. Unfortunately, YouTube API do not return all comments due to limitations imposed by them. – Thoth Aug 16 '14 at 15:17
  • Thanks for your interest again. When using `videoId=ZSzeFFsKEt4` the script stop after page 2. Doing this manually you can go further. Is this because of python or because of restrictions that the [site](http://www.sandracires.com/en/client/youtube/comments.php?v=ZSzeFFsKEt4&page=1) imposes? Any suggestions? Thanks again. – Thoth Aug 16 '14 at 16:23
  • It looks like their server requires cookies. I've updated my answer to use `requests.session` instead. Now it should retrieve 34 pages. – mhawke Aug 17 '14 at 11:32
  • I'm not sure about restrictions on the number of comments retrievable via the API. By default it's 25 comments per request, and you need to follow the "next" link to visit all results. But there is a problem with that in that the "next" link returned by the API increases in size until it is too long. That's probably the limiting factor? – mhawke Aug 17 '14 at 11:35
  • Hi @mhawke thanks fot the answer. For your last comment you are right the next page token is becoming extremely big (tip: you have to use `orderby=published` in the `gdata` url). – Thoth Aug 21 '14 at 17:53