25

In browsers such as Firefox or Safari, with a website open, I can right click the page, and select something like: "View Page Source" or "View Source." This shows the HTML source for the page.

In Ruby, is there a function (maybe a library) that allows me to store this HTML source as a variable? Something like this:

source = view_source(http://stackoverflow.com)

where source would be this text:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Stack Overflow</title>
etc
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Eric
  • 1,235
  • 2
  • 13
  • 26

8 Answers8

30

Use Net::HTTP:

require 'net/http'

source = Net::HTTP.get('stackoverflow.com', '/index.html')
Alan W. Smith
  • 24,647
  • 4
  • 70
  • 96
robbrit
  • 17,560
  • 4
  • 48
  • 68
18
require "open-uri"
source = open(url){ |f| f.read }

UPD: Ruby >=1.9 allows syntax

require "open-uri"
source = open(url, &:read)

UPD: Ruby >=3.0 demands syntax

require "open-uri"
source = URI(url).open(&:read)
Nakilon
  • 34,866
  • 14
  • 107
  • 142
  • 2
    Even shorter: `source = open(url).read` – Mark Thomas Nov 18 '10 at 18:01
  • 2
    @Mark Thomas, it will not close connection. – Nakilon Nov 18 '10 at 19:16
  • 2
    Both of these will close the connection? – Tom Rossi Sep 08 '13 at 20:05
  • And still the thing where it needs to be `URI.open...` (fwiw, I'm on Ruby 3.1.0). – Sixtyfive Aug 23 '22 at 10:12
  • As observed by Mark Thomas and Matt Rose, the simple `.read` is shorter. With neither `.read` nor `, &:read)` nor `{|f| f.read}` nor any invocation of `.close` was I able to observe connections being closed right away (but then, my way of testing is to simply keep watching the output of `netstat` while opening 100 connections). – Sixtyfive Aug 23 '22 at 10:27
  • In contrast, `u=URI.parse('https://domain.tld:443/');c=Net::HTTP.new(u.host,u.port);c.use_ssl=true;c.get(u.request_uri)` opens and closes in one fell swoop. So perhaps using `URI#open` should be discouraged?? – Sixtyfive Aug 23 '22 at 10:34
13
require 'open-uri'
source = open(url).read

short, simple, sweet.

Matt Rose
  • 337
  • 2
  • 3
7

Yes, like this:

require 'open-uri'

open('http://stackoverflow.com') do |file|
    #use the source Eric
    #e.g. file.each_line { |line| puts line }
end
Skilldrick
  • 69,215
  • 34
  • 177
  • 229
3

You could use the builtin Net::HTTP:

>> require 'net/http'
>> Net::HTTP.get 'stackoverflow.com', '/'

Or one of the several libraries suggested in "Equivalent of cURL for Ruby?".

Community
  • 1
  • 1
Josh Lee
  • 171,072
  • 38
  • 269
  • 275
3
require 'mechanize'

agent = Mechanize.new
page = agent.get('http://google.com/')

puts page.body

you can then do a lot of other cool stuff with mechanize as well.

Beanish
  • 1,672
  • 9
  • 20
2

Another thing you might be interested in is Nokogiri. It is an HTML, XML, etc. parser that is very easy to use. Their front page has some example code that should get you started and see if it's right for what you need.

Topher Fangio
  • 20,372
  • 15
  • 61
  • 94
  • 1
    Nokogiri has nothing to do with retrieving a page, it only parses the page once it's been retrieved by a HTTP client or read from a file. It's a very important distinction. – the Tin Man Dec 18 '15 at 19:18
  • @theTinMan - Indeed, this was more informational and perhaps should have been posted as a comment rather than an answer. My assumption was that after getting the HTML, the OP would want to do something with it :-) – Topher Fangio Dec 18 '15 at 19:21
  • 1
    We hope they'd want to do something more with it, rather than clog a network and bog down a CPU. – the Tin Man Dec 18 '15 at 19:22
2

If you have cURL installed, you could simply:

url = 'http://stackoverflow.com'
html = `curl #{url}`

If you want to use pure Ruby, look at the Net::HTTP library:

require 'net/http'
stack = Net::HTTP.new 'stackoverflow.com'
# ...later...
page = '/questions/4217223/how-to-get-the-html-source-of-a-webpage-in-ruby'
html = stack.get(page).body
Nakilon
  • 34,866
  • 14
  • 107
  • 142
Phrogz
  • 296,393
  • 112
  • 651
  • 745