How do I fix this Nokogiri document result to make it legible?

Question

I'm trying to scrape kickass.to and I'm having difficultly returning a legible document.

Here's my code:

require 'nokogiri'
require 'open-uri'

url = "http://kickass.to/usearch/Mobile%20Suit%20Gundam:%20Char%27s%20Counterattack%201988category:movies/"
doc = Nokogiri::HTML(open(url))

result:

#<Nokogiri::HTML::Document:0x3ffb45c23ab4 name="document" children=[#<Nokogiri::XML::DTD:0x3ffb45c23744 name="html">, #<Nokogiri::XML::Element:0x3ffb45c26fc0 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26db8 name="body" children=[#<Nokogiri::XML::Element:0x3ffb45c26bb0 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c269a8 "\u008B      å}ùvÛF²÷ßñSt8Ç\u009142H,Y\u0092©Åñ\u008Cíx,%\u0099\\_],\r\tÐX$Ñ\u0093y¢ï¾ÿî\u0093Ý_u ¸\u0088\"eÑ\u008E3>>\"6º««ªkëBõþ÷Ç?\u009Dÿöæ\u0084õ\u0093áàðÑ>}°\u009Bá \u0088*ý$íÕj×××Õk£F½\u009AÖn·k7Ô¦Â\\?:¨\u0092¨BOqË=|Äðo\u007FÈ\u009D%#\u007FLý«\u0083ÊQ$">, #<Nokogiri::XML::Element:0x3ffb45c268cc name="h">]>]>, #<Nokogiri::XML::Element:0x3ffb45c26480 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26278 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c26070 "T~\u0093Ô¨§§Ìé[QÌ\u0093\u00834ñ\u0094V¥vWGgÉxÀvçÄñôã\u00815ä\u0097ÇNä\u008F?J CÎ¬ÀenxBËeÃÐö\u009CÅ©\u009F°^¸ÖpOÀ¶ì³\u0088¬$±\u009CKfÙq8H>3/\u008C\u0098q^e§V\u009C}ÅUvìGÜ\u0099ÜaW¾Å~\u007FÃ¬+ËXö\u0080/\u00825\ní0\u0089K`¡¸ü¦Â\">8¨¤1·\"§_¯=\u0083ó0\u008A@\u0094\u00981ýÝw.8Îoí×d§\u0092\u009C?¸\u0094CÇ\u0084ö¸ÏyRa\th\u0099\u0091\u0090pÎú÷*µúI¬ÄwªN8¬Y\u0083\u0081¢µ\u009Aå\u0094.\u008DÑ£ÄIæ\u0083OnéÖZ=×Uñ§\u0092÷ôhfk4«$aêô\u0095»»\u009Cm]=Ñ·ìö{Eyç{l\u0090°'¬ù>cSüÂùcÎ5\u009F7¦q ¨¸\u00959N¾\u007FÇ×÷Þ+Êa6«løuÆn>üØUçÝ\u00924ÿìùJt·óaåJfqäÌñÛ\u0087XÈ³:ô\u0083bâÀ\u009D%ný\u0080Å'»¨î×äUFÈ[1ÞK8Q¼ á.\u008A·\u008BÁ×ßB\u0092\u0096¡£WVÄ.\u0084°\u007F\t\u0086¤{ôp+æ¾»Ç¶²·õdª\u0089ËÈ¢\u008B\u0081ôö\u0098:ý

You get the picture. It's illegible and I can't seem to figure out where particular elements are. Any ideas where to go from here?

utf-8. Are you on windows by any chance? If so this may help: http://stackoverflow.com/questions/2572396/nokogiri-open-uri-and-unicode-characters — Patrick Oscity, May 21 '14 at 22:09
I'm on OSX. Can you post how it's returning for you? I'd appreciate it. — icantbecool, May 21 '14 at 22:10
Here you go: https://gist.githubusercontent.com/padde/97fc826b62f63eddc29f/raw/41b817dcc4c559454ee5f316ed6dd478686b00c6/gistfile1.txt — Patrick Oscity, May 21 '14 at 22:14

Nizmox · Answer 1 · 2014-05-22T00:50:57.183

I think you misunderstand how Nokogiri works. Nokogiri does not return the raw HTML on the requested page, it wraps each DOM element within a Nokogiri object and returns a Nokogiri enumerable object that contains all of these elements.

It is difficult to help you as It's unclear if you want to extract all of the HTML or specific parts of the page. Nokogiri works by using CSS style selectors to 'query' the Nokogiri object and extract the elements you want.

If you refer to the Nokogiri docs this will help, but using there example...

doc.css('h3.r a').each do |link|
    puts link.content
end

This assumes you have a variable containing results of a Nokogiri scrape (in your case you've also used 'doc'). This then performs a search for all nodes that are links (a tags) that are contained within an h3 tag with the class of 'r'. In this case they are looping through the elements that match this criteria (.css function also returns an enumerable as there could be multiple elements matching the criteria) and printing these to console.

Patrick Oscity · Accepted Answer · 2018-02-13T11:14:35.317

1

Works fine for me on MRI Ruby 2.1.1. You can either try to re-install/update Nokogiri and/or do the same with Ruby.

edited Feb 13 '18 at 11:14

answered May 22 '14 at 04:18

Patrick Oscity

53,604
17
144
168

the link is broken – Devstr Feb 13 '18 at 09:20

How do I fix this Nokogiri document result to make it legible?

2 Answers2