How can I remove HTML using Ruby regular expressions?

Question

I want to remove everything contained within two HTML tags, as well as the tags themselves, using regular expressions in Ruby. Here's an example:

<tag>a bunch of stuff between the tags, no matter what it is</tag>

Basically, I want to use gsub! to filter all instances of this type out, like so:

text_file_contents.gsub!(/appropriate regex/, '')

What would be a good Ruby regular expression for doing so?

Use [Nokogiri](http://nokogiri.org/) when you have HTML/XML in your dish to digest it safely.. — Arup Rakshit, Jun 17 '14 at 10:53
[You can't parse HTML with RegExp!](http://stackoverflow.com/a/1732454/838733) — nietonfir, Jun 17 '14 at 10:58

Mike H-R · Accepted Answer · 2020-02-16T13:46:22.747

7

As has been said in the comments use an html parser. If, however, you just want to remove everything between two tags and don't care about nesting (e.g. if you have <tag><tag></tag></tag>) then you can simply use:

text_file_contents.gsub!(/<tag>.*?<\/tag>/, '')

But again this is flaky. Nokogiri is really easy to use and will be a lot more stable, please use that.

require 'nokogiri'
doc = Nokogiri::XML(yourfile)
doc.search('//tag').each do |node|
  node.remove
end

edited Feb 16 '20 at 13:46

answered Jun 17 '14 at 11:23

Mike H-R

7,726
5
43
65

Forward slashes must be escaped right ? text_file_contents.gsub!(/.*?<\/tag>/, '') – praaveen V R Feb 14 '20 at 11:47
you're right. Thanks @praaveen. I've edited to fix that line. (main point is not to use a regex to parse html though!) – Mike H-R Feb 16 '20 at 13:47

How can I remove HTML using Ruby regular expressions?

1 Answers1