Using python to take advantage of web page functions

Question

I am trying to understand how this web site is working. There is an input form where you can provide a url. This form returns information retrieved from another site (Youtube). So:

My first and more interesting question is if anybody has any idea how this site retrieve the entire corpus of statements?

Alternatively, since now I am using the following code:

from BeautifulSoup import BeautifulSoup
import json

urlstr = 'http://www.sandracires.com/en/client/youtube/comments.php?v=' + videoId + '&page=' + str(npage)
url = urllib2.urlopen(urlstr)
content = url.read()
soup = BeautifulSoup(content)
#parse json
newDictionary=json.loads(str(soup)) 

#print example
print newDictionary['list'][1]['username']

However, I can not iterate in all pages (which is not happening when I to that manually). I have placed timer.sleep(30) below json but without success. Why is that happening?

Thanks!

^{Python 2.7.8}

score 0 · Accepted Answer · edited Aug 21 '14 at 18:05

0

Probably using the Google Youtube data API. Note that (presently) comments can only be retrieved using version 2 of the API - which has been deprecated. Apparently no support yet in V3. Python clients libraries are available, see https://developers.google.com/youtube/code#Python.

Response is already JSON, no need for BS. The web server seems to require cookies, so I recommend using requests module, in particular its session management:

import requests

videoId = 'ZSzeFFsKEt4'
results = []
npage = 1
session = requests.session()
while True:
    urlstr = 'http://www.sandracires.com/en/client/youtube/comments.php'
    print "Getting page ", npage
    response = session.get(urlstr, params={'v': videoId, 'page': npage})
    content = response.json()
    if len(content['list']) > 1:
        results.append(content)
    else:
        break
    npage += 1

print results

edited Aug 21 '14 at 18:05

Thoth

993
12
36

answered Aug 16 '14 at 15:09

mhawke

84,695
9
117
138

Thanks for the response. Unfortunately, YouTube API do not return all comments due to limitations imposed by them. – Thoth Aug 16 '14 at 15:17
Thanks for your interest again. When using `videoId=ZSzeFFsKEt4` the script stop after page 2. Doing this manually you can go further. Is this because of python or because of restrictions that the [site](http://www.sandracires.com/en/client/youtube/comments.php?v=ZSzeFFsKEt4&page=1) imposes? Any suggestions? Thanks again. – Thoth Aug 16 '14 at 16:23
It looks like their server requires cookies. I've updated my answer to use `requests.session` instead. Now it should retrieve 34 pages. – mhawke Aug 17 '14 at 11:32
I'm not sure about restrictions on the number of comments retrievable via the API. By default it's 25 comments per request, and you need to follow the "next" link to visit all results. But there is a problem with that in that the "next" link returned by the API increases in size until it is too long. That's probably the limiting factor? – mhawke Aug 17 '14 at 11:35
Hi @mhawke thanks fot the answer. For your last comment you are right the next page token is becoming extremely big (tip: you have to use `orderby=published` in the `gdata` url). – Thoth Aug 21 '14 at 17:53

Using python to take advantage of web page functions

1 Answers1