-1

I want to remove everything contained within two HTML tags, as well as the tags themselves, using regular expressions in Ruby. Here's an example:

<tag>a bunch of stuff between the tags, no matter what it is</tag>

Basically, I want to use gsub! to filter all instances of this type out, like so:

text_file_contents.gsub!(/appropriate regex/, '')

What would be a good Ruby regular expression for doing so?

lucperkins
  • 768
  • 1
  • 7
  • 15

1 Answers1

7

As has been said in the comments use an html parser. If, however, you just want to remove everything between two tags and don't care about nesting (e.g. if you have <tag><tag></tag></tag>) then you can simply use:

text_file_contents.gsub!(/<tag>.*?<\/tag>/, '')

But again this is flaky. Nokogiri is really easy to use and will be a lot more stable, please use that.

require 'nokogiri'
doc = Nokogiri::XML(yourfile)
doc.search('//tag').each do |node|
  node.remove
end
Mike H-R
  • 7,726
  • 5
  • 43
  • 65