0

I'm trying to scrape kickass.to and I'm having difficultly returning a legible document.

Here's my code:

require 'nokogiri'
require 'open-uri'

url = "http://kickass.to/usearch/Mobile%20Suit%20Gundam:%20Char%27s%20Counterattack%201988category:movies/"
doc = Nokogiri::HTML(open(url))

result:

#<Nokogiri::HTML::Document:0x3ffb45c23ab4 name="document" children=[#<Nokogiri::XML::DTD:0x3ffb45c23744 name="html">, #<Nokogiri::XML::Element:0x3ffb45c26fc0 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26db8 name="body" children=[#<Nokogiri::XML::Element:0x3ffb45c26bb0 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c269a8 "\u008B      å}ùvÛF²÷ßñSt8Ç\u009142H,Y\u0092©Åñ\u008Cíx,%\u0099\\_],\r\tÐX$Ñ\u0093y¢ï¾ÿî\u0093Ý_u ¸\u0088\"eÑ\u008E3>>\"6º««ªkëBõþ÷Ç?\u009Dÿöæ\u0084õ\u0093áàðÑ>}°\u009Bá \u0088*ý$íÕj×××Õk£F½\u009AÖn·k7Ô¦Â\\?:¨\u0092¨BOqË=|Äðo\u007FÈ\u009D%#\u007FLý«\u0083ÊQ$">, #<Nokogiri::XML::Element:0x3ffb45c268cc name="h">]>]>, #<Nokogiri::XML::Element:0x3ffb45c26480 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26278 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c26070 "T~\u0093Ô¨§§Ìé[QÌ\u0093\u00834ñ\u0094V¥vWGgÉxÀvçÄñôã\u00815ä\u0097ÇNä\u008F?J CάÀenxBËeÃÐö\u009CÅ©\u009F°^¸ÖpOÀ¶ì³\u0088¬$±\u009CKfÙq8H>3/\u008C\u0098q^e§V\u009C}ÅUvìGÜ\u0099ÜaW¾Å~\u007Fì+ËXö\u0080/\u00825\ní0\u0089K`¡¸ü¦Â\">8¨¤1·\"§_¯=\u0083ó0\u008A@\u0094\u00981ýÝw.­8Îoí×d§\u0092\u009C?¸\u0094CÇ\u0084ö¸ÏyRa\th\u0099\u0091\u0090pÎú÷*µúI¬ÄwªN8¬Y\u0083\u0081¢µ\u009Aå\u0094.\u008DÑ£ÄIæ\u0083OnéÖZ=×Uñ§\u0092÷ôhfk4«$aêô\u0095»»\u009Cm]=Ñ·ìö{Eyç{l\u0090°'¬ù>cSüÂùcÎ5\u009F7¦q ¨¸\u00959N¾\u007FÇ×÷Þ+Êa6«løuÆn>üØ­UçÝ\u00924ÿìùJt·óaåJfqäÌñÛ\u0087Xȳ:ô\u0083bâÀ\u009D%ný\u0080Å'»¨î×äUFÈ[1ÞK8Q¼ á.\u008A·\u008BÁ×ßB\u0092\u0096¡£WVÄ.­\u0084°\u007F\t\u0086¤{ôp+澻Ƕ²·õdª\u0089ËÈ¢\u008B\u0081ôö\u0098:ý

You get the picture. It's illegible and I can't seem to figure out where particular elements are. Any ideas where to go from here?

icantbecool
  • 482
  • 8
  • 16

2 Answers2

1

I think you misunderstand how Nokogiri works. Nokogiri does not return the raw HTML on the requested page, it wraps each DOM element within a Nokogiri object and returns a Nokogiri enumerable object that contains all of these elements.

It is difficult to help you as It's unclear if you want to extract all of the HTML or specific parts of the page. Nokogiri works by using CSS style selectors to 'query' the Nokogiri object and extract the elements you want.

If you refer to the Nokogiri docs this will help, but using there example...

doc.css('h3.r a').each do |link|
    puts link.content
end

This assumes you have a variable containing results of a Nokogiri scrape (in your case you've also used 'doc'). This then performs a search for all nodes that are links (a tags) that are contained within an h3 tag with the class of 'r'. In this case they are looping through the elements that match this criteria (.css function also returns an enumerable as there could be multiple elements matching the criteria) and printing these to console.

Nizmox
  • 17
  • 3
1

Works fine for me on MRI Ruby 2.1.1. You can either try to re-install/update Nokogiri and/or do the same with Ruby.

Patrick Oscity
  • 53,604
  • 17
  • 144
  • 168