0

I want to scrape usernames from youtube comments, like in the page:

http://www.youtube.com/all_comments?v=mIA0W69U2_Y

I want to get all the username/display name like: "fedfields", "mystik dread" and the corresponding links(when you click on "fedfields", it will link to its profile) I want to scrapte them using automate bash scripts I have the following questions:

1 my original approach is to write automate scripts which use wget to download the page and then use regex to process the page to get those names, but this way, I need to download the whole page, each page is several MB, if I download a lot of pages, it takes up to much space, are there better ways?

2 there are many pages, like in the link, there are 7 pages, is it possible to get them all in one page?

Sergiu Dumitriu
  • 11,455
  • 3
  • 39
  • 62
wenzi
  • 113
  • 2
  • 11

6 Answers6

2

You can use HtmlAgilityPack in your C# application.

        HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
        IEnumerable<HtmlNode> userNames = doc.DocumentNode.Descendants("a").Where(
            d => d.Attributes.Contains("class") &&   
            d.Attributes["class"].Value.Contains("yt-user-name"));

Useful info about parsing html with RegEx

I don't know if youtube content has native gzip compression, but you can check it with WebRequest class. If yes it will reduce traffic significantly.

webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = WebRequestMethods.Http.Get;
webRequest.KeepAlive = true;
webRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
webRequest.Headers.Add("Accept-Encoding", "gzip,deflate");
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse(); 
MessageBox.Show(webResponse.ContentEncoding.ToString());

And then you can read stream and get user names with HTMLAgilityPack.

Community
  • 1
  • 1
Pavel K
  • 496
  • 3
  • 15
2

Use ScrapeGoat on mashape to return all usernames as a json object :)

https://www.mashape.com/warting/scrapegoat/

curl --include --request GET 'https://scrapegoat.p.mashape.com/?url=http%3A%2F%2Fwww.youtube.com%2Fall_comments%3Fv%3DmIA0W69U2_Y&selector=.yt-user-name' --header "X-Mashape-Authorization: <MASHAPE API KEY>"

Result:

{"message":"ok","payload":["whitehouse","Osambasucks2","Osambasucks2","Osambasucks2","omar barazanji","omar barazanji","omar barazanji","omar barazanji","omar barazanji","omar barazanji","HigherPlanes","HigherPlanes","HigherPlanes","RamonaFromPomona","RamonaFromPomona","Osambasucks2","Osambasucks2","Osambasucks2","RamonaFromPomona","terminator360tm","terminator360tm","terminator360tm","terminator360tm","terminator360tm","terminator360tm","Osambasucks2","Osambasucks2","Osambasucks2","Joe Lackey","Joe Lackey","Joe Lackey","ThaGenius101","ThaGenius101","ThaGenius101","Joe Lackey","Ed Patowski","Ed Patowski","Ed Patowski","toughdogyt","toughdogyt","toughdogyt","Osambasucks2","Osambasucks2","Osambasucks2","goodkarmaband","goodkarmaband","Martynas Valiukas","Martynas Valiukas","Martynas Valiukas","goodkarmaband","goodkarmaband","goodkarmaband","Martynas Valiukas","XRedstone688X","XRedstone688X","XRedstone688X","goodkarmaband","Trevor Jones","Trevor Jones","Trevor Jones","goodkarmaband","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","leeman6417","leeman6417","leeman6417","Osambasucks2","Osambasucks2","Osambasucks2","leeman6417","sosocrazy1234","sosocrazy1234","sosocrazy1234","leeman6417","liamdudeeee","liamdudeeee","liamdudeeee","sosocrazy1234","sosocrazy1234","sosocrazy1234","sosocrazy1234","leeman6417","Ed Patowski","Ed Patowski","Ed Patowski","mastershakelock","mastershakelock","mastershakelock","VGQgex","VGQgex","VGQgex","Osambasucks2","Osambasucks2","Osambasucks2","VGQgex","MindzEnt","MindzEnt","MindzEnt","William willie","William willie","William willie","William willie","William willie","William willie","bkdmd","bkdmd","bkdmd","Osambasucks2","Osambasucks2","Osambasucks2","bkdmd","Rafael Vargas","Rafael Vargas","Rafael Vargas","7even2wenty1","7even2wenty1","7even2wenty1","cashlessbread","cashlessbread","cashlessbread","base3798","base3798","base3798","Ed Patowski","Ed Patowski","Ed Patowski","base3798","john smith","john smith","john smith","Ed Patowski","Neftali Acosta","Neftali Acosta","Neftali Acosta","Ed Patowski","Ed Patowski","Ed Patowski","Neftali Acosta","john smith","john smith","john smith","Neftali Acosta","Canal YooCheckTheFloow","Canal YooCheckTheFloow","Canal YooCheckTheFloow","Abandonbeast","Abandonbeast","Abandonbeast","Canal YooCheckTheFloow","Ironcitytony72","Ironcitytony72","Ironcitytony72","john smith","john smith","john smith","Ironcitytony72","Andrew Apelt","Andrew Apelt","Andrew Apelt","Ironcitytony72","Osambasucks2","Osambasucks2","Osambasucks2","Andrew Apelt","Andrew Apelt","Andrew Apelt","Andrew Apelt","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Andrew Apelt","incas94","incas94","incas94","Osambasucks2","William willie","William willie","William willie","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Andrew Apelt","Andrew Apelt","Osambasucks2","LawnMowerfromHell","LawnMowerfromHell","LawnMowerfromHell","Ironcitytony72","Osambasucks2","Osambasucks2","Osambasucks2","TheAndr3tzi","TheAndr3tzi","TheAndr3tzi","thumsupformyusername","thumsupformyusername","thumsupformyusername","algett","algett","algett","thumsupformyusername","thumsupformyusername","thumsupformyusername","thumsupformyusername","algett","ferkondenster","ferkondenster","ferkondenster","Christian Heinrich","Christian Heinrich","Christian Heinrich","erieejustice911","erieejustice911","erieejustice911","ferkondenster","ferkondenster","ferkondenster","Seth Farsides","Seth Farsides","Seth Farsides","ferkondenster","ferkondenster","ferkondenster","Seth Farsides","Seth Farsides","Seth Farsides","ferkondenster","Doky9889","Doky9889","Doky9889","ferkondenster","ferkondenster","ferkondenster","ferkondenster","Doky9889","sealrk19","sealrk19","sealrk19","wiljam12345","wiljam12345","wiljam12345","Dwayne Cole","Dwayne Cole","Dwayne Cole","Osambasucks2","Osambasucks2","Osambasucks2","Dwayne Cole","Jax Jr","Jax Jr","Jax Jr","Rafael Vargas","Rafael Vargas","Rafael Vargas","William willie","William willie","William willie","William willie","William willie","William willie","Gunnar Rowe","Gunnar Rowe","Gunnar Rowe","Rafael Vargas","Rafael Vargas","Rafael Vargas","Susan Porter","Susan Porter","Susan Porter","derp toth","derp toth","derp toth","MXNR16","nick62301","nick62301","nick62301","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","SeventhSun","SeventhSun","SeventhSun","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Rafael Vargas","Rafael Vargas","Rafael Vargas","senormierda","senormierda","senormierda","Rafael Vargas","chrisgilofficial","chrisgilofficial","chrisgilofficial","MXNR16","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","chrisgilofficial","chrisgilofficial","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","chrisgilofficial","chrisgilofficial","chrisgilofficial","Osambasucks2","Andrew Apelt","Andrew Apelt","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","aztecadog","aztecadog","aztecadog","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","ThePhase20","ThePhase20","ThePhase20","ICE778","ICE778","ICE778","Sabrina Blacks","Sabrina Blacks","Sabrina Blacks","Darwin Gutierrez","Darwin Gutierrez","Darwin Gutierrez","lessonsfromryan","tooncrazy1","tooncrazy1","tooncrazy1","unbreackable3000","unbreackable3000","unbreackable3000","Barack Obama","Barack Obama","Barack Obama","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","tooncrazy1","tooncrazy1","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","Americaunderduress","Americaunderduress","Americaunderduress","Barack Obama","Barack Obama","Barack Obama","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","FoodStampBarry","FoodStampBarry","FoodStampBarry","Barack Obama","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","myviewsontheworld","myviewsontheworld","myviewsontheworld","SuperNikoYT","SuperNikoYT","SuperNikoYT","myviewsontheworld","Osambasucks2","Osambasucks2","Osambasucks2","myviewsontheworld","Americaunderduress","Americaunderduress","Americaunderduress","myviewsontheworld","Asuma741","Asuma741","Asuma741","RevolutionNewz","damonjo15","damonjo15","damonjo15","Osambasucks2","Osambasucks2","Osambasucks2","damonjo15","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","tooncrazy1","tooncrazy1","tooncrazy1","Aries2012100","KH AK","KH AK","KH AK","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","kangaroo3259","kangaroo3259","kangaroo3259","Aries2012100","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","youhan younen","youhan younen","youhan younen","tooncrazy1","tooncrazy1","tooncrazy1","youhan younen","Osambasucks2","Osambasucks2","Osambasucks2","youhan younen","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Osambasucks2","Osambasucks2","Osambasucks2","Stevejobsultimate2","Rafael Vargas","Rafael Vargas","Rafael Vargas","drewpert0515","drewpert0515","drewpert0515","dv wfwefwe","TheAlienContactee","TheAlienContactee","TheAlienContactee","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Jordan Beckwith","Jordan Beckwith","Jordan Beckwith","Michael Carrillo","Michael Carrillo","Michael Carrillo","gotwess","gotwess","gotwess","gotwess","Michael Carrillo","Michael Carrillo","Michael Carrillo","Michael Carrillo","gotwess","Jawad Pullin","Jawad Pullin","Jawad Pullin","TreborHG93","tooncrazy1","tooncrazy1","tooncrazy1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","kinggrindhard","kinggrindhard","kinggrindhard","branoaas branoaas","branoaas branoaas","branoaas branoaas","Osambasucks2","Osambasucks2","Osambasucks2","branoaas branoaas","branoaas branoaas","branoaas branoaas","branoaas branoaas","Theindicud","Theindicud","Theindicud","eizieizz","eizieizz","eizieizz","Osambasucks2","Osambasucks2","Osambasucks2","eizieizz","1990Zuck","1990Zuck","1990Zuck","ArcoZakus","ArcoZakus","ArcoZakus","firemedic30ca","johnny grove","johnny grove","johnny grove","joost1v","joost1v","joost1v","Osambasucks2","Osambasucks2","Osambasucks2","joost1v","5sdk1","5sdk1","5sdk1","jeff brennan","jeff brennan","jeff brennan","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","jeff brennan","jeff brennan","jeff brennan","jeff brennan","Bo James","aztecadog","aztecadog","aztecadog","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","Paul Pascalau","Paul Pascalau","Paul Pascalau","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","aztecadog","aztecadog","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","aztecadog","aztecadog","aztecadog","Osambasucks2","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Ed Patowski","Ed Patowski","Ed Patowski","Zajac Staszek","aztecadog","aztecadog","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","gotwess","gotwess","gotwess","aztecadog","JeremyTheMoose","JeremyTheMoose","JeremyTheMoose","5sdk1","5sdk1","5sdk1","fordbronco1991","fordbronco1991","fordbronco1991","andy kerver","andy kerver","andy kerver","Omarimage","Omarimage","Omarimage","Omarimage","Omarimage","Omarimage","justin lionti","justin lionti","justin lionti","Omarimage","Butheadbros2","Butheadbros2","Butheadbros2","Omarimage","moonbeamrider1","moonbeamrider1","moonbeamrider1","justin lionti","justin lionti","justin lionti","moonbeamrider1","moonbeamrider1","moonbeamrider1","moonbeamrider1","justin lionti","fordbronco1991","fordbronco1991","fordbronco1991","pellenyberg","pellenyberg","pellenyberg","Son Goku","Son Goku","Son Goku","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","fisch kopf","fisch kopf","fisch kopf","andrew baker","andrew baker","andrew baker","FVCKDA POPO","FVCKDA POPO","FVCKDA POPO","MrChessmans","MrChessmans","MrChessmans","BryndisiDali","Brazzer man","Brazzer man","Brazzer man","Jack Thompson","ecw141685","ecw141685","ecw141685","Osambasucks2","Osambasucks2","Osambasucks2","ecw141685","lps24evelyn","lps24evelyn","lps24evelyn","erieejustice911","erieejustice911","erieejustice911","erieejustice911","erieejustice911","erieejustice911","Keepskatin","Keepskatin","Keepskatin","erieejustice911","V V","V V","V V","Keepskatin","Abrahan Peraza","Abrahan Peraza","Abrahan Peraza","lexyloveful","Zratedguns","Zratedguns","Zratedguns","MadNoys1","MadNoys1","MadNoys1","MadNoys1","Zratedguns","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","Joseph Pal","Joseph Pal","Joseph Pal","Joseph Pal","MadNoys1","MadNoys1","MadNoys1","MadNoys1","bear cat","laurynas stirbys","laurynas stirbys","laurynas stirbys","newjerusalem newtestament","newjerusalem newtestament","newjerusalem newtestament","amerilstones","amerilstones","amerilstones","newjerusalem newtestament","Keepskatin","Keepskatin","Keepskatin","newjerusalem newtestament","amerilstones","amerilstones","amerilstones","Keepskatin","Noah Neo","Noah Neo","Noah Neo","charmander4533","charmander4533","charmander4533","Noah Neo","Noah Neo","Noah Neo","Noah Neo","charmander4533","Noah Neo","Noah Neo","Noah Neo","charmander4533","Osambasucks2","Osambasucks2","Osambasucks2","Noah Neo","George Washington","George Washington","George Washington","charmander4533","izizdropshotz","izizdropshotz","izizdropshotz","charmander4533","Wavanova","Wavanova","Wavanova","charmander4533","wisestfoolalive","wisestfoolalive","wisestfoolalive","Noah Neo","Noah Neo","Noah Neo","Noah Neo","wisestfoolalive","colin dooley","colin dooley","colin dooley","colin dooley","colin dooley","colin dooley","Silme037","Silme037","Silme037","colin dooley","Keepskatin","Keepskatin","Keepskatin","colin dooley","princelord55","princelord55","princelord55","Osambasucks2","Osambasucks2","Osambasucks2","princelord55","DriadonRapShow","DriadonRapShow","DriadonRapShow","eddrum100","eddrum100","eddrum100","Ryan S","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Ryan S","Ryan S","Ryan S","eddrum100","eddrum100","eddrum100","Ryan S","Ryan S","Ryan S","Ryan S","Ryan S","eddrum100","eddrum100","eddrum100","eddrum100","RatedMForModz","RatedMForModz","RatedMForModz","alban97","alban97","alban97","RatedMForModz","Alex Bannon","Alex Bannon","Alex Bannon","alban97","alban97","alban97","alban97","Alex Bannon","james aaron","james aaron","james aaron","RatedMForModz","Ryan S","Ryan S","Ryan S","Dylan N","killllshot","killllshot","killllshot","Saadia Khan","Saadia Khan","talithatf17","talithatf17","talithatf17","amerilstones","amerilstones","amerilstones","talithatf17","BENGHAZIneverForget","BENGHAZIneverForget","BENGHAZIneverForget","talithatf17","talithatf17","talithatf17","supergrover6868","supergrover6868","supergrover6868","talithatf17","Alexander Sigsworth","Alexander Sigsworth","Alexander Sigsworth","supergrover6868","Zratedguns","Zratedguns","Zratedguns","supergrover6868","Keepskatin","Keepskatin","Keepskatin","Zratedguns","Butheadbros2","Butheadbros2","Butheadbros2","Zratedguns","Omegeist","Omegeist","Omegeist","supergrover6868","2Dmensions","2Dmensions","2Dmensions","talithatf17","talithatf17","talithatf17","supergrover6868","supergrover6868","supergrover6868","talithatf17","newjerusalem newtestament","newjerusalem newtestament","newjerusalem newtestament","supergrover6868","VGQgex","VGQgex","VGQgex","talithatf17","talithatf17","talithatf17","talithatf17","Mandragara","Mandragara","Mandragara","talithatf17","deathzbo","deathzbo","deathzbo","Mandragara","Mandragara","Mandragara","deathzbo","Mandragara","Mandragara","Mandragara","deathzbo","deathzbo","deathzbo","deathzbo","Mandragara","eddrum100","eddrum100","eddrum100","Mandragara","Mandragara","Mandragara","Mandragara","eddrum100","Unit01232","Unit01232","Unit01232","supergrover6868","supergrover6868","supergrover6868","Unit01232","Osambasucks2","Osambasucks2","Osambasucks2","supergrover6868","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Unit01232","Unit01232","Unit01232","Unit01232","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Unit01232","eddrum100","eddrum100","eddrum100","senormierda","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Kevin Koala","Kevin Koala","Kevin Koala","senormierda","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","GGRSC","GGRSC","GGRSC","GGRSC","eddrum100","michael smith","michael smith","michael smith","GGRSC","GGRSC","GGRSC","truthinvideos","supergrover6868","supergrover6868","supergrover6868","GGRSC","supergrover6868","supergrover6868","supergrover6868","eddrum100","eddrum100","eddrum100","supergrover6868","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","supergrover6868","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","supergrover6868","supergrover6868","supergrover6868","bobothecreepyclown","eddrum100","eddrum100","eddrum100","supergrover6868","supergrover6868","supergrover6868","supergrover6868","eddrum100","eddrum100","eddrum100","eddrum100","supergrover6868","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","willypdyer","willypdyer","willypdyer","Osambasucks2","Osambasucks2","Osambasucks2","willypdyer","spairtain","spairtain","spairtain","DigitalAcceptance","DigitalAcceptance","DigitalAcceptance","ElRancholo2","Osambasucks2","Osambasucks2","Osambasucks2","DigitalAcceptance","ElRancholo2","ElRancholo2","ElRancholo2","DigitalAcceptance","Osambasucks2","Osambasucks2","Osambasucks2","ElRancholo2","Mark Tse","Mark Tse","Mark Tse","DigitalAcceptance","Mark Tse","Mark Tse","Mark Tse","Mark Tse","The Best","The Best","The Best","supergrover6868","supergrover6868","supergrover6868","creativeengineer","creativeengineer","creativeengineer","eddrum100","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","creativeengineer","creativeengineer","creativeengineer","Ed Patowski","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","eddrum100","eddrum100","eddrum100","creativeengineer","creativeengineer","creativeengineer","creativeengineer","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","creativeengineer","supergrover6868","supergrover6868","supergrover6868","creativeengineer","creativeengineer","creativeengineer","creativeengineer","supergrover6868","supergrover6868","supergrover6868","creativeengineer","comicozy87","comicozy87","comicozy87","Raven Gomez","turbidhat","turbidhat","turbidhat","Daracon1010","Daracon1010","Daracon1010","Daracon1010","turbidhat","turbidhat","turbidhat","Daracon1010","VGQgex","VGQgex","VGQgex","Daracon1010","Daracon1010","Daracon1010","Daracon1010","VGQgex","WeThePeopleNoNWO","WeThePeopleNoNWO","WeThePeopleNoNWO","amerilstones","zmanthecool","zmanthecool","zmanthecool","metal220","supergrover6868","supergrover6868","supergrover6868","1974wolfman","1974wolfman","1974wolfman","William willie","William willie","William willie","1974wolfman","1974wolfman","1974wolfman","1974wolfman","William willie","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Kanwar Judge","Kanwar Judge","Kanwar Judge","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","abu bakr","abu bakr","abu bakr","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","Obamalies100","amerilstones","amerilstones","amerilstones","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","amerilstones","amerilstones","amerilstones","amerilstones","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","eddrum100","ThaYayo","ThaYayo","ThaYayo","William willie","chrisn365","chrisn365","chrisn365","Eli Jackson","Eli Jackson","Eli Jackson","Jboulos12","Frank Adams","Frank Adams","Frank Adams","amerilstones","amerilstones","amerilstones","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","eddrum100","eddrum100","eddrum100","amerilstones","amerilstones","amerilstones","amerilstones","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","supergrover6868","supergrover6868","supergrover6868","amerilstones","amerilstones","amerilstones","amerilstones","supergrover6868","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","LiamborninDC","LiamborninDC","LiamborninDC","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","LiamborninDC","Osambasucks2","Osambasucks2","Osambasucks2","William willie","Osambasucks2","Osambasucks2","Osambasucks2","killllshot","killllshot","killllshot","killllshot","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","killllshot","killllshot","killllshot","killllshot","Osambasucks2","supergrover6868","supergrover6868","supergrover6868","killllshot","Osambasucks2","Osambasucks2","Osambasucks2","killllshot"],"status":200}
Wärting
  • 1,086
  • 1
  • 12
  • 19
0

Do this:

import re
import sys
import time
import urllib2

html = True

argv_list = sys.argv
if len(argv_list) == 2:
    vid = argv_list[1]
else:
    vid = "mIA0W69U2_Y"

regex = re.compile("<span class=\"author.*?<a href=\"(.*?)\".*? dir=\"ltr\">(.*?)</a>", re.DOTALL | re.UNICODE | re.IGNORECASE)

index = 1
author_lists = []
t1 = time.time()
print "######################### Start #########################"

while 1:
    url = "http://www.youtube.com/watch_ajax?action_get_comments=1&v="+vid+"&commenttype=everything&source=w&page_size=500&p="+str(index)+"&format=XML"
    print "Retrieving page "+str(index)+": ", url
    o = urllib2.urlopen(url)
    r = o.read()
    elements = regex.findall(r)
    author_list = []
    for x, y in elements:

        if x.startswith("http://") or x.startswith("https://"):
            continue
        xx = "".join(["http://www.youtube.com", x])
        href = xx.strip()
        #print href


        if "</span>" not in y :
            uname = y.strip()
        else:
            uname = y.split("</span>")[0].strip()

        if uname.startswith("<a"):
            continue

        if not uname or not href:
            continue

        if html:
            #1 output html
            author = "".join(["<a href=\"", href, "\">", uname, "</a>"])
        else:
            #2 output txt
            author = " ".join([uname, href])

        author_list.append(author)

    t = "%02d:%02d:%02d" % reduce(lambda ll,b : divmod(ll[0],b) + ll[1:], [(time.time()-t1,),60,60])
    print "".join(["Time passed: ", t])
    if not author_list:
        break
    else:
        author_lists.extend(author_list)
    index+=1
    #break #uncomment it if you only want to test one page

print "######################### Finished #########################"
print "Total comments: ", len(author_lists)
if author_lists:
    author_lists.sort()
    last = author_lists[-1]
    for i in range(len(author_lists)-2, -1, -1):
        if last == author_lists[i]:
            del author_lists[i]
        else:
            last = author_lists[i]
    if html:
        authors = "<br>".join(author_lists)
        authors = "".join(["<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8'><body>", authors, "</body></html>"])
        fname = vid+".html"
    else:
        authors = "\n".join(author_lists)
        fname = vid+".txt"

    #print "Authors: ", authors
    print "Total commenters: ", len(author_lists)



    oo = open(fname, "w")
    oo.write(authors)
    oo.close()
print "######################### Exist #########################"

Example txt output:

enter image description here

Example html output:

enter image description here

林果皞
  • 7,539
  • 3
  • 55
  • 70
0

C# can also help this way (although HAP and WebRequest are better):

     SHDocVw.InternetExplorer ie = new
            SHDocVw.InternetExplorerClass();
            WebBrowser wb = (WebBrowser)ie;
            wb.Visible = true;
            //Do anything else with the window here that you wish
            wb.Navigate("https://adwords.google.co.uk/um/Logout", ref o, ref o, ref o, ref o);
            while (wb.Busy) { Thread.Sleep(100); }
            HTMLDocument document = ((HTMLDocument)wb.Document);
            IHTMLElement element = document.getElementById("Email");
            HTMLInputElementClass email = (HTMLInputElementClass)element;
            email.value = "testtestingtton@gmail.com";
            email = null;
            element = document.getElementById("Passwd");
            HTMLInputElementClass pass = (HTMLInputElementClass)element;
            pass.value = "pass";
            pass = null;
            element = document.getElementById("signIn");
            HTMLInputElementClass subm = (HTMLInputElementClass)element;
            subm.click();
            subm = null;
Zameer Ansari
  • 28,977
  • 24
  • 140
  • 219
0

write rssfeeds for the name field and other fields that you want to extract Use automated plugins to setup the crawler follow the below steps How to extract the data from multiple website

Vijay G
  • 1
  • 1
0

Here is the simple solution using ruby and gems nokogiri and open-uri

require 'nokogiri'
require 'open-uri'
url="https://www.youtube.com/all_comments?v=mIA0W69U2_Y"
dom=Nokogiri::HTML(open(url))
dom.xpath("//div[@class='comment-entry']").each do |comment|
  username=comment.xpath(".//a[contains(@class,'user-name')]").first
  username=username.content.chomp.strip if username
  profilelink=comment.xpath(".//a[contains(@class,'user-name')]/@href").first
  profilelink=profilelink.content.chomp.strip if profilelink
  profilelink="http://www.youtube.com"+profilelink if profilelink.match(/^\//)
  puts "#{username} #{profilelink}" if username and profilelink
end

For more info visit How to extract data easily from multiple websites

fearless_fool
  • 33,645
  • 23
  • 135
  • 217
Vijay
  • 1