1

I have this code, and I need to add a regex ahead of "href=" for integers:

f = File.open("us.html")
doc = Nokogiri::HTML(f)

ans = doc.css('a[href=]')

puts doc

I tried doing:

ans = doc.css('a[href=\d]

or:

ans = doc.css('a[href="\d"])

but it doesn't work. Can anyone suggest a workaround?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Rohan Dalvi
  • 1,215
  • 1
  • 16
  • 38

2 Answers2

4

If you want to use a regular expression, I believe you will have to do that manually. It cannot be done with a CSS or XPath selector.

You can do it by iterating through the elements and comparing their href attribute to your regular expression. For example:

html = %q{
<html>
  <a href='1'></a>
  <a href='adf'></a>
</html>
}

doc = Nokogiri::HTML(html)
ans = doc.css('a[href]').select{ |e| e['href'] =~ /\d/}
#=> <a href="1"></a>
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Justin Ko
  • 46,526
  • 5
  • 91
  • 101
  • @MarkThomas You can match numbers (per the OP's desire), but [you cannot match a regex in XPath](http://stackoverflow.com/a/405507/405017) which is what the answer says. – Phrogz Sep 26 '13 at 04:47
  • +1 for correctly suggesting that sometimes it's easier and faster (for the programmer; slower for the computer) to get halfway with CSS or XPath and then finish up the problem in Ruby. – Phrogz Sep 26 '13 at 04:48
  • @Phrogz Thanks for pointing that out. In my initial read, I interpreted the "it" in "It cannot be done..." to be "The OP's problem". – Mark Thomas Sep 26 '13 at 10:08
2

You can do it in XPath:

require 'nokogiri'

html = %q{
<html>
  <a href='1'></a>
  <a href='adf'></a>
</html>
}

doc = Nokogiri::HTML(html)

puts doc.xpath('//a[@href[number(.) = .]]')
#=> <a href="1"></a>

The XPath function number() does a conversion to a number. If it equals the node itself, then the node is a number. It is even possible to check a range using inequality operators.

Mark Thomas
  • 37,131
  • 11
  • 74
  • 101