Don't use regex or string parsing. Those will only make your head hurt. Use a parser.
In Ruby I'd use Nokogiri:
require 'nokogiri'
html = '
<html>
<body>
<nav>...</nav>
<section>...</section>
</body>
</html>
'
doc = Nokogiri::HTML(html)
nav = doc.at('nav').content = "this is a new block"
puts doc.to_html
Which outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<nav>this is a new block</nav><section>...</section>
</body></html>
Of course you'd want to replace "this is a new block"
with something like File.read('snippet.html')
.
If your file of substitutions contains HTML snippets instead of the nav
content, use this instead:
nav = doc.at('nav').replace('<nav>this is a new block</nav>')
The output would be the same. (And, again, use File.read
to grab that from a file if that's how you lean.)
In Nokogiri, at
finds the first instance of the tag specified by a CSS or XPath accessor and returns the Node. I used CSS above, but //nav
would have worked also. at
guesses at the type of accessor. You can use at_css
or at_xpath
if you want to be specific, because it's possible to have ambiguous accessors. Also, Nokogiri has search
, which returns a NodeSet, which acts like an array. You can iterate over the results doing what you want. And, like at
, there are CSS and XPath specific versions, css
and xpath
respectively.
Nokogiri has a CLI interface, and, for something as simple as this example it would work, but I could also do it in sed or a Ruby/Perl/Python one-liner.
curl -s http://nokogiri.org | nokogiri -e'p $_.css("h1").length'
HTML is seldom this simple though, especially anything that is found roaming the wilds, and a CLI or one-liner solution will rapidly grow out of control, or simply die. I say that based on years of writing many spiders and RSS aggregators -- what starts out simple grows a lot more complex when you introduce an additional HTML or XML source, and it never gets easier. Using parsers taught me to go to them first.