Download HTML Text with Ruby

Question

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.

My current code:

#!/usr/local/bin/ruby


require 'net/http'
require 'open-uri'


# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)

def open(url)
    Net::HTTP.get(URI.parse(url))
end

page_content = open('_insert_webpage_here')

page_content.each do |i|
    puts i
end

This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:

<body><h1>Object Moved</h1>This document may be found <a HREF="http://stackoverflow.com/">here</a></body>

Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.

Is there any reasonably easy way to do this?

http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby — Flexoid, May 02 '12 at 21:32
I should have added without nokogiri. I am running it from my school's server which doesn't have it installed for me to use. — Linell, May 02 '12 at 21:37

score 2 · Accepted Answer · edited May 23 '17 at 11:56

2

When you require 'open-uri', you don't need to redefine open with Net::HTTP.

require 'open-uri'

page_content = open('http://www.stackoverflow.com').read

histogram = {}
page_content.each_char do |c|
  histogram[c] ||= 0
  histogram[c] += 1
end

Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).

edited May 23 '17 at 11:56

Community

1
1

answered May 02 '12 at 21:50

Benjamin Manns

9,028
4
37
48

Thank you! I am very new to Ruby though, and I was wondering if you could explain what the `||= 0` part is doing? – Linell May 02 '12 at 22:10
1

Say the first character is `'h'` for hello. The `||= 0` part will check if `histogram['h']` has been set, and if not, it will initialize it to 0. It is the same as executing `histogram['h'] = histogram['h'] || 0`. Initializing `histogram = Hash.new(0)` should work, but sometimes I have issues with it. – Benjamin Manns May 02 '12 at 22:15

score 1 · Answer 2 · answered May 02 '12 at 21:49

1

See the section "Following Redirection" on the Net::HTTP Documentation here

answered May 02 '12 at 21:49

codatory

686
4
5

score 1 · Answer 3 · answered May 02 '12 at 21:51

1

Stripping html tags without Nokogiri

puts page_content.gsub(/<\/?[^>]*>/, "")

http://codesnippets.joyent.com/posts/show/615

answered May 02 '12 at 21:51

Dru

9,632
13
49
68

Download HTML Text with Ruby

3 Answers3