0

I have a string like:

<a href="x.com">x.com</a>

..in which I want to replace the text of the tag so it is wrapped in a <i> tag:

<a href="x.com"><i>x.com</i></a>

Using regex >.*<, I get a match for >x.com< but I really just want the exact text so I can gsub it:

'<a href="x.com">x.com</a>'.gsub(<what here?>,<what here?>)

How do I do this?

UPDATE

Ps. This is in Rails 3.0.3 on Ruby 1.8.7 p330

Zabba
  • 64,285
  • 47
  • 179
  • 207
  • Don't ever try to "parse" HTML with regular expressions. That is not going to work. Use an actual parser like nokogiri. – Holger Just Mar 10 '11 at 00:05

5 Answers5

2

Nokogiri is a great tool for parsing HTML and XML in Ruby. Using it frees you from dealing with all sorts of HTML inconsistencies due to malformed markup, or changing structure.

This would wrap the contents of all <a> tags throughout a HTML document:

require 'nokogiri'

html = '<a href="x.com">x.com</a>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)

doc.search('a').each do |_node|  
  _node.inner_html = "<i>#{_node.content}</i>"
end

puts doc

# >> <a href="x.com"><i>x.com</i></a>
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

I would very strongly recommend not editing HTML like this, but this should do what you want:

'<a href="x.com">x.com</a>'.gsub(/>(.*?)</, '><i>\1</i><')
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • How to avoid having to include `>` and `<` in the replacement text `>\1<` ? – Zabba Mar 09 '11 at 23:20
  • See sawa's answer, which utilizes lookahead and lookbehind assertions to only match the target text. Alternatively you could also put the `>` and `<` inside of parentheses and use `\1\2\3` as the replacement text, but that seems like overkill. – Andrew Clark Mar 09 '11 at 23:23
1

Use (?<=pattern) which specifies the preceding context and (?=pattern) which specifies the following context.

'<a href="x.com">x.com</a>'.gsub(/(?<=\>).*?(?=\<)/, '<i>\0</i>')
sawa
  • 165,429
  • 45
  • 277
  • 381
  • I get an error: `undefined (?...) sequence: /(?<=\>).*?(?=\<)/`. (I'm using Ruby 1.8.7, if that matters?) – Zabba Mar 09 '11 at 23:19
  • Yes. Its only available in ruby1.9. Sorry about that. But you can make it work on ruby1.8.7 if you install oniguruma. (Please don't ask me how to. Some other person may be able to help you with that.) – sawa Mar 09 '11 at 23:30
  • It appears that Ruby does not support lookbehind. Reference: http://www.regular-expressions.info/lookaround.html – Andrew Clark Mar 09 '11 at 23:31
  • @Andrew Ruby1.9 uses oniguruma as the regexp engine, and it does have lookbehind. – sawa Mar 09 '11 at 23:35
  • 1
    Installing oniguruma under ruby1.8 seems easier than I thought. Try this link: [link](http://oniguruma.rubyforge.org/) – sawa Mar 09 '11 at 23:56
1

How about adding parenthesis using >(.*)< not >.*< ?

billaraw
  • 938
  • 1
  • 7
  • 28
0

I don’t know Ruby very well, but maybe it has an HTML parsing library that can do this more reliably than regexes?

Obligatory link re parsing HTML with regexes

Community
  • 1
  • 1
Paul D. Waite
  • 96,640
  • 56
  • 199
  • 270
  • I only need this for a very simple case and would rather not go too far with HTML parsing n all .. thanks for that link, though. – Zabba Mar 09 '11 at 23:27
  • @Zabba: sure, for simple stuff like this it’s probably easier busting out a regex. – Paul D. Waite Mar 10 '11 at 02:05
  • "it’s probably easier busting out a regex". Until the HTML changes. Or you need to process HTML from two different sources. – the Tin Man Mar 10 '11 at 05:36
  • @the Tin Man: well, maybe. But surely all approaches need reworking when the HTML changes? I’m not quite clear what difference it makes if you’re processing HTML from two different sources either? – Paul D. Waite Mar 10 '11 at 14:06
  • A parser isolates you from HTML changes in many ways that Regex can't, unless you have an extremely complicated pattern. I've written a lot of regex and a lot of HTML scrapers and spiders, into the hundreds, and learned some hard lessons. HTML isn't regex friendly so for anything beyond a trivial use a parser simplifies the code and maintenance. – the Tin Man Mar 10 '11 at 16:18