how to get html class values using regular expression in ruby

Question

I have this below string from which I want to extract class values "ruby", "html", "java". My objective here is understanding / learning regular expressions that I have always dreaded :-).

<div class="ruby" name="ruby_doc">
<div class="html" name="html_doc">
<div class="java" name="java_doc">

This is what I have so far

str = <<END
<div class="ruby" name="ruby_doc">
<div class="html" name="html_doc">
<div class="java" name="java_doc">
END

str.scan(/"[^"]+/) #=> returns
["\"ruby", "\" name=", "\"ruby_doc", "\">\n<div class=", "\"html",...]

str.scan(/class="[^"]+/) #=> ["class=\"ruby", "class=\"html", "class=\"java"]

str.scan(/"(\w)+?"/) #=> [["ruby"], ["ruby_doc"], ["html"], ["html_doc"], ...]

score 7 · Accepted Answer · answered Sep 15 '13 at 14:47

7

str.scan(/\b(?<=class=\")[^"]+(?=\")/)
# => ["ruby", "html", "java"]

answered Sep 15 '13 at 14:47

sawa

165,429
45
277
381

Arup Rakshit · Answer 2 · 2013-09-15T14:36:11.113

3

Use Nokogiri for this :

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-_html_
<div class="ruby" name="ruby_doc">
<div class="html" name="html_doc">
<div class="java" name="java_doc">
_html_

# to get values of class attribute
doc.xpath('//div/@class').map(&:to_s)
# => ["ruby", "html", "java"]
# to get values of name attribute
doc.xpath('//div/@name').map(&:to_s)
# => ["ruby_doc", "html_doc", "java_doc"]

edited Sep 15 '13 at 14:36

answered Sep 15 '13 at 14:30

Arup Rakshit

116,827
30
260
317

Since this is part of my regular expression learning, I am looking to achieve it with regex. – Bala Sep 15 '13 at 14:36
1

@Bala For html always use html parser,don't use RegEx.. See this post http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Arup Rakshit Sep 15 '13 at 14:38

score 2 · Answer 3 · answered Sep 15 '13 at 14:42

2

Parsing HTML with regex is not recommended. If you had to write a somewhat ok regex, then you could try with

 str.scan /<div\s+class=\s*"([^"]+)/
 #=> [["ruby"], ["html"], ["java"]]

answered Sep 15 '13 at 14:42

bsd

2,707
1
17
24

score 1 · Answer 4 · answered Sep 15 '13 at 14:42

1

You really should use Nokogiri as per @Arup's answer. But, if you insist...

str.scan(/(?:class\=\")(\w+)(?:\")/).flatten

Live test in Ruby console

2.0.0p247 :001 > str = <<END
2.0.0p247 :002"> <div class="ruby" name="ruby_doc">
2.0.0p247 :003"> <div class="html" name="html_doc">
2.0.0p247 :004"> <div class="java" name="java_doc">
2.0.0p247 :005"> END
 => "<div class=\"ruby\" name=\"ruby_doc\">\n<div class=\"html\" name=\"html_doc\">\n<div class=\"java\" name=\"java_doc\">\n" 
2.0.0p247 :006 > str.scan(/(?:class\=\")(\w+)(?:\")/).flatten
 => ["ruby", "html", "java"]

answered Sep 15 '13 at 14:42

Marcelo De Polli

28,123
4
37
47

Can I request you please explain what each regex group does? – Bala Sep 15 '13 at 14:45
1

The first group requires that `class="` be present, but not captured. The second group captures any sequence of word characters. The third group requires that `"` be present, but not captured. – Marcelo De Polli Sep 15 '13 at 14:46

score -3 · Answer 5 · answered Sep 15 '13 at 15:39

-3

Howsabout:

str.scan /"(.*?)"/
#=> [["ruby"], ["ruby_doc"], ["html"], ["html_doc"], ["java"], ["java_doc"]]

answered Sep 15 '13 at 15:39

pguardiario

53,827
19
119
159

how to get html class values using regular expression in ruby

5 Answers5

Live test in Ruby console