5

I need to crawl down all of the comments (more than 2,600,000 comments, over 5000 pages) for PSY's Gangnam Style Video from YouTube, see: http://www.youtube.com/all_comments?v=9bZkp7q19f0

The problem is:

1) If I use gdata service, google provides only no more than 1000 comment feeds

2) If I directly crawl html tags from:

site(http://www.youtube.com/all_comments?v=9bZkp7q19f0&page=$(page))

by increasing the page parameter, it would fail after page #101, where no comments displayed on the page.

So plz everyone, how can I get around this problem?

P.S: My crawler is implemented as a chrome extension using javascript, which checks the comment tags of the loaded page, and then loading next page.

nickhar
  • 19,981
  • 12
  • 60
  • 73
  • I am not exactly sure, but don't you have to pay for more comments? I think that's the reason why is there a limit. – ioan Nov 03 '12 at 20:57
  • thanks for your advice, but i'm not familiar with buying service quota from google; you got any such kinda experience ? or where can i find related doc ? – Robin Hsiang Nov 06 '12 at 09:33
  • [stackoverflow - how to fetch more than 1000](http://stackoverflow.com/questions/264154/google-appengine-how-to-fetch-more-than-1000) - Does this help you? :-) – ioan Nov 06 '12 at 12:15
  • Did you try [TubeKit](http://www.tubekit.org/)? – tatsuhirosatou Nov 27 '12 at 03:05

1 Answers1

1

You may be able to extract the data by crawling the pages and hacking the code for the problems encountered, but that is not the proper way.

You should use the youtube api for this and check the other developer resources concerning to this.

mtk
  • 13,221
  • 16
  • 72
  • 112
  • 1
    i already tried the youtube gdata api, but google limits the returned result length to no more than 1000, check [YouTube API Ref](https://developers.google.com/youtube/2.0/reference?hl=en)...and even i manually click on the all_comments page, i still cannot navigate to #102 page. – Robin Hsiang Nov 03 '12 at 13:47
  • Were you able to fetch even 1000 comments. I can fetch only 99. :-/ – P.C. Sep 20 '14 at 14:20