0

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.

My current code:

#!/usr/local/bin/ruby


require 'net/http'
require 'open-uri'


# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)

def open(url)
    Net::HTTP.get(URI.parse(url))
end

page_content = open('_insert_webpage_here')

page_content.each do |i|
    puts i
end

This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:

<body><h1>Object Moved</h1>This document may be found <a HREF="http://stackoverflow.com/">here</a></body>

Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.

Is there any reasonably easy way to do this?

Jacob Schoen
  • 14,034
  • 15
  • 82
  • 102
Linell
  • 750
  • 3
  • 9
  • 27
  • http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby – Flexoid May 02 '12 at 21:32
  • I should have added without nokogiri. I am running it from my school's server which doesn't have it installed for me to use. – Linell May 02 '12 at 21:37

3 Answers3

2

When you require 'open-uri', you don't need to redefine open with Net::HTTP.

require 'open-uri'

page_content = open('http://www.stackoverflow.com').read

histogram = {}
page_content.each_char do |c|
  histogram[c] ||= 0
  histogram[c] += 1
end

Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).

Community
  • 1
  • 1
Benjamin Manns
  • 9,028
  • 4
  • 37
  • 48
  • Thank you! I am very new to Ruby though, and I was wondering if you could explain what the `||= 0` part is doing? – Linell May 02 '12 at 22:10
  • 1
    Say the first character is `'h'` for hello. The `||= 0` part will check if `histogram['h']` has been set, and if not, it will initialize it to 0. It is the same as executing `histogram['h'] = histogram['h'] || 0`. Initializing `histogram = Hash.new(0)` should work, but sometimes I have issues with it. – Benjamin Manns May 02 '12 at 22:15
1

See the section "Following Redirection" on the Net::HTTP Documentation here

codatory
  • 686
  • 4
  • 5
1

Stripping html tags without Nokogiri

puts page_content.gsub(/<\/?[^>]*>/, "")

http://codesnippets.joyent.com/posts/show/615

Dru
  • 9,632
  • 13
  • 49
  • 68