search and replace with ruby regex

Question

I have a text blob field in a MySQL column that contains HTML. I have to change some of the markup, so I figured I'll do it in a ruby script. Ruby is irrelevant here, but it would be nice to see an answer with it. The markup looks like the following:

<h5>foo</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>bar</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>meow</h5>
  <table>
    <tbody>
    </tbody>
  </table>

I need to change just the first <h5>foo</h5> block of each text to <h2>something_else</h2> while leaving the rest of the string alone.

Can't seem to get the proper PCRE regex, using Ruby.

I implore you to consider using an HTML parser instead of using regex for html. As it has been said [many](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), [many](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not), [many](http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la?lq=1) times before, Regex parsers are incapable of accurately parsing HTML. — Travis Kaufman, Apr 18 '13 at 21:16
Specifically, I recommend using [Nokogiri](http://nokogiri.org) to load your HTML, manipulate it, and then emit the result. — Phrogz, Sep 26 '14 at 19:32

Phrogz · Accepted Answer · 2011-01-16T01:59:22.677

# The regex literal syntax using %r{...} allows / in your regex without escaping
new_str = my_str.sub( %r{<h5>[^<]+</h5>}, '<h2>something_else</h2>' )

Using String#sub instead of String#gsub causes only the first replacement to occur. If you need to dynamically choose what 'foo' is, you can use string interpolation in regex literals:

new_str = my_str.sub( %r{<h5>#{searchstr}</h5>}, "<h2>#{replacestr}</h2>" )

Then again, if you know what 'foo' is, you don't need a regex:

new_str = my_str.sub( "<h5>searchstr</h5>", "<h2>#{replacestr}</h2>" )

or even:

my_str[ "<h5>searchstr</h5>" ] = "<h2>#{replacestr}</h2>"

If you need to run code to figure out the replacement, you can use the block form of sub:

new_str = my_str.sub %r{<h5>([^<]+)</h5>} do |full_match|
  # The expression returned from this block will be used as the replacement string
  # $1 will be the matched content between the h5 tags.
  "<h2>#{replacestr}</h2>"
end

score 6 · Answer 2 · answered Jan 16 '11 at 02:12

Whenever I have to parse or modify HTML or XML I reach for a parser. I almost never bother with regex or instring unless it's absolutely a no-brainer.

Here's how to do it using Nokogiri, without any regex:

text = <<EOT
<h5>foo</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>bar</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>meow</h5>
  <table>
    <tbody>
    </tbody>
  </table>
EOT

require 'nokogiri'

fragment = Nokogiri::HTML::DocumentFragment.parse(text)
print fragment.to_html

fragment.css('h5').select{ |n| n.text == 'foo' }.each do |n|
  n.name = 'h2'
  n.content = 'something_else'
end

print fragment.to_html

After parsing, this is what Nokogiri has returned from the fragment:

# >> <h5>foo</h5>
# >>   <table><tbody></tbody></table><h5>bar</h5>
# >>   <table><tbody></tbody></table><h5>meow</h5>
# >>   <table><tbody></tbody></table>

This is after running:

# >> <h2>something_else</h2>
# >>   <table><tbody></tbody></table><h5>bar</h5>
# >>   <table><tbody></tbody></table><h5>meow</h5>
# >>   <table><tbody></tbody></table>

score 2 · Answer 3 · edited Aug 12 '14 at 01:34

2

Use String.gsub with the regular expression <h5>[^<]+<\/h5>:

>> current = "<h5>foo</h5>\n  <table>\n    <tbody>\n    </tbody>\n  </table>"
>> updated = current.gsub(/<h5>[^<]+<\/h5>/){"<h2>something_else</h2>"}
=> "<h2>something_else</h2>\n  <table>\n    <tbody>\n    </tbody>\n  </table>"

Note, you can test ruby regular expression comfortably in your browser.

edited Aug 12 '14 at 01:34

Ross Attrill

2,594
1
22
31

answered Jan 16 '11 at 01:54

miku

181,842
47
306
310

search and replace with ruby regex

3 Answers3

Linked