0

Possible Duplicate:
Method to parse HTML document in Ruby?

If in the variable results I have:

<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><link             rel="stylesheet" type="text/css" href="http://2.ai/styles/hello.css" media="screen"/><title>Welcome to Dotgeek.org * 1.ai</title></head><body>..... etc

How can I parse, if possible without using any gem, the title of that HTML page that I now have in the results variable?

Community
  • 1
  • 1
devnull
  • 2,752
  • 1
  • 21
  • 38
  • 1
    Note: If this is for "general use" (ie, can expect any HTML that is valid), you should really REALLY use an HTML parser, not regular expressions or any other tricks that don't involve recreating the DOM – Earlz Sep 07 '12 at 13:45
  • Why not use a gem? Nokogiri makes short work of accurately parsing HTML and is the recommended way of doing this. Otherwise look at [ReXML](http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/) which comes with Ruby. – the Tin Man Sep 07 '12 at 13:55
  • 1
    Yeah.. `Nokogiri.HTML(content).at('title').text #=> "Welcome to Dotgeek.org * 1.ai"` – Lee Jarvis Sep 07 '12 at 14:12
  • but if you can do it with match without having to depend on yet another gem (since I am checking only for the title) why should I use the gem ? :) – devnull Sep 07 '12 at 16:50

2 Answers2

4
html = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><link rel="stylesheet" type="text/css" href="http://2.ai/styles/hello.css" media="screen"/><title>Welcome to Dotgeek.org * 1.ai</title></head>'
html.match(/<title>(.*)<\/title>/)[1] #=> "Welcome to Dotgeek.org * 1.ai"
Danil Speransky
  • 29,891
  • 5
  • 68
  • 79
  • 8
    Mandatory link for parsing HTML with regexps: http://stackoverflow.com/a/1732454/908515 – undur_gongor Sep 07 '12 at 13:45
  • 1
    Sorry, I don't understand. That linked post "explains" why parsing HTML with regexps should normally be avoided. So, I posted it (although I admit that in the case given a regexp might be appropriate). Earlz's comment above is smarter but points into the same direction. – undur_gongor Sep 07 '12 at 13:51
  • Granted, I didn't say your proposal is wrong. But we don't know the concrete issue. Maybe it is mission-critical software with a public web interface. So devnull should be aware of the restrictions. – undur_gongor Sep 07 '12 at 13:55
  • both works - the page I am parsing is a standard parking page so I am just checking if the title is the one of the default one or not..why wouldn't this be good enough or dangerous ? doh! in 1 line I can save 2 gems (why? because gems do break and I don't want to depend on 2 gems for a simple check!!) – devnull Sep 07 '12 at 16:49
  • as a side comment what would be an easy way to convert the html ? seems not to work for pages with non standard characters... – devnull Sep 07 '12 at 20:08
  • solved thanks to someone in IRC https://gist.github.com/3670098 or better said adapted it to continue parsing if there are problems – devnull Sep 07 '12 at 22:52
  • i actually meant https://gist.github.com/3670449 – devnull Sep 08 '12 at 02:05
0

You could simply split by the title tag like this:

title = result.split(/<title>/,2)[1].split(/<\/title>/,2)[0]

(edit: the second parameter to split works different than I am used to from python and doesn't count the number of splits but the number of elements in the result array, meaning split(/pattern/, 1) actually doesn't split anything...)

l4mpi
  • 5,103
  • 3
  • 34
  • 54
  • hello, this seems to work too but I have an issue in a site that seems to use another language for the title see curl = %x(curl http://zales.1.ai) simian = curl.match(/(.*)<\/title>/)[1] puts simian throws in `<main>': undefined method `[]' for nil:NilClass (NoMethodError)</main> – devnull Sep 07 '12 at 20:29
  • The call to `match` returns nil because it can't find the pattern, and trying to use `[]` on nil leads to this error (see http://stackoverflow.com/questions/3835428/what-do-an-undefined-method-mean-in-rails). The page you linked to does not contain a `` tag (in fact, this is its complete source: `<h1>Ahoj svete :)</h1>`) so you can't match it. – l4mpi Sep 08 '12 at 00:12