8

I have a text blob field in a MySQL column that contains HTML. I have to change some of the markup, so I figured I'll do it in a ruby script. Ruby is irrelevant here, but it would be nice to see an answer with it. The markup looks like the following:

<h5>foo</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>bar</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>meow</h5>
  <table>
    <tbody>
    </tbody>
  </table>

I need to change just the first <h5>foo</h5> block of each text to <h2>something_else</h2> while leaving the rest of the string alone.

Can't seem to get the proper PCRE regex, using Ruby.

randombits
  • 47,058
  • 76
  • 251
  • 433
  • 2
    I implore you to consider using an HTML parser instead of using regex for html. As it has been said [many](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), [many](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not), [many](http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la?lq=1) times before, Regex parsers are incapable of accurately parsing HTML. – Travis Kaufman Apr 18 '13 at 21:16
  • Specifically, I recommend using [Nokogiri](http://nokogiri.org) to load your HTML, manipulate it, and then emit the result. – Phrogz Sep 26 '14 at 19:32

3 Answers3

31
# The regex literal syntax using %r{...} allows / in your regex without escaping
new_str = my_str.sub( %r{<h5>[^<]+</h5>}, '<h2>something_else</h2>' )

Using String#sub instead of String#gsub causes only the first replacement to occur. If you need to dynamically choose what 'foo' is, you can use string interpolation in regex literals:

new_str = my_str.sub( %r{<h5>#{searchstr}</h5>}, "<h2>#{replacestr}</h2>" )

Then again, if you know what 'foo' is, you don't need a regex:

new_str = my_str.sub( "<h5>searchstr</h5>", "<h2>#{replacestr}</h2>" )

or even:

my_str[ "<h5>searchstr</h5>" ] = "<h2>#{replacestr}</h2>"

If you need to run code to figure out the replacement, you can use the block form of sub:

new_str = my_str.sub %r{<h5>([^<]+)</h5>} do |full_match|
  # The expression returned from this block will be used as the replacement string
  # $1 will be the matched content between the h5 tags.
  "<h2>#{replacestr}</h2>"
end
Phrogz
  • 296,393
  • 112
  • 651
  • 745
6

Whenever I have to parse or modify HTML or XML I reach for a parser. I almost never bother with regex or instring unless it's absolutely a no-brainer.

Here's how to do it using Nokogiri, without any regex:

text = <<EOT
<h5>foo</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>bar</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>meow</h5>
  <table>
    <tbody>
    </tbody>
  </table>
EOT

require 'nokogiri'

fragment = Nokogiri::HTML::DocumentFragment.parse(text)
print fragment.to_html

fragment.css('h5').select{ |n| n.text == 'foo' }.each do |n|
  n.name = 'h2'
  n.content = 'something_else'
end

print fragment.to_html

After parsing, this is what Nokogiri has returned from the fragment:

# >> <h5>foo</h5>
# >>   <table><tbody></tbody></table><h5>bar</h5>
# >>   <table><tbody></tbody></table><h5>meow</h5>
# >>   <table><tbody></tbody></table>

This is after running:

# >> <h2>something_else</h2>
# >>   <table><tbody></tbody></table><h5>bar</h5>
# >>   <table><tbody></tbody></table><h5>meow</h5>
# >>   <table><tbody></tbody></table>
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
2

Use String.gsub with the regular expression <h5>[^<]+<\/h5>:

>> current = "<h5>foo</h5>\n  <table>\n    <tbody>\n    </tbody>\n  </table>"
>> updated = current.gsub(/<h5>[^<]+<\/h5>/){"<h2>something_else</h2>"}
=> "<h2>something_else</h2>\n  <table>\n    <tbody>\n    </tbody>\n  </table>"

Note, you can test ruby regular expression comfortably in your browser.

Ross Attrill
  • 2,594
  • 1
  • 22
  • 31
miku
  • 181,842
  • 47
  • 306
  • 310