1

Hello every one i have a html code as code bellow. I want to get the text inside <a>(.*)</a>

I want to get this result:

data 1 : hello1
data 2 : hello2
data 3 : hello3

from that input:

<a>
hello1
</a>
<a>
hello2
</a>
<a>
hello3
</a>
Seki
  • 11,135
  • 7
  • 46
  • 70
AHmedRef
  • 2,555
  • 12
  • 43
  • 75
  • 6
    [Don't parse html with regexps](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Seki Jul 18 '12 at 11:32
  • Use a dedicated HTML parser like [Nokogiri](http://nokogiri.org/) instead – Stefan Jul 18 '12 at 11:46

1 Answers1

2

To expand on the two comments, the following Nokogiri code will work for your example. You can use either xpath or CSS. A dedicated parser is much more powerful than rolling your own regex.

> require 'nokogiri'
 => true 
> doc = Nokogiri::HTML("<a>hello1</a><a>hello2</a><a>hello3</a>")
 => #<Nokogiri::HTML::Document:0x3ffec2494f48 name="document" children=[#<Nokogiri::XML::DTD:0x3ffec2494bd8 name="html">, #<Nokogiri::XML::Element:0x3ffec2494458 name="html" children=[#<Nokogiri::XML::Element:0x3ffec2494250 name="body" children=[#<Nokogiri::XML::Element:0x3ffec2494048 name="a" children=[#<Nokogiri::XML::Text:0x3ffec2493e40 "hello1">]>, #<Nokogiri::XML::Element:0x3ffec249dc88 name="a" children=[#<Nokogiri::XML::Text:0x3ffec249da80 "hello2">]>, #<Nokogiri::XML::Element:0x3ffec249d878 name="a" children=[#<Nokogiri::XML::Text:0x3ffec249d670 "hello3">]>]>]>]> 
> doc.css('a').each { |node| p node.text }
"hello1"
"hello2"
"hello3"
 => 0 

Update: You'll need the nokogiri gem if you don't have it installed already.

sudo gem install nokogiri

Depending on your setup, you may also need to prepend

require 'rubygems'
peakxu
  • 6,667
  • 1
  • 28
  • 27
  • LoadError: cannot load such file -- nokogiri from C:/Ruby193/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:i n `require' from i got : C:/Ruby193/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:i n `require' from (irb):1 from C:/Ruby193/bin/irb:12:in `
    '
    – AHmedRef Jul 18 '12 at 12:06