I've been tasked with programmatically splitting up an HTML master template into subtemplates, performing some minor modifications to each template, and saving the resultant files.
Why I'm using Bash (you can skip this to get to the question)
(edited) The choice to use Bash is fairly arbitrary. I know it runs Red Hat 5.5, so I'm writing a Bash script. It does have a PHP interpreter, but I decided not to use PHP for this, since the same server will be hosting the site which uses PHP and will probably see a lot of traffic, so I'm afraid of tying up a FastCGI socket every hour doing this operation (I don't control the frequency of script execution, just what is executed). I can also install whatever interpreter I want (script languages I actually know already: Perl, Python, PHP, maybe Lua). That, however, is a different question. This question assumes I want to use a Bash script.
The Problem
I have a master template file, which looks something like:
<!DOCTYPE html PUBLIC .... >
<html lang="en" ...>
<head> ... </head>
<body>
<div id=...></div>
<div id=...></div>
</body>
</html>
From this, I need to parse from the top of the document up to </head>
, strip a few lines from that section and add one in, replace the <title>
placeholder with the actual title, and save that to a file. Then I need to parse the <body>
and first <div>
out as a separate file, and then finally the second <div>
(to which I also need to make some changes in the page footer). I will be discarding </body>
and </html>
since this template is actually part of a two-layer template (the replaced page title will use a Smarty variable to get its text).
The Conundrum
The question is, is there an easier/better way to do this than regex? I know Bash provides the compound command [[ htmlstring =~ "/regex/" ]]
and ${BASH_REMATCH}
to match or replace, but I also know that parsing HTML with regex is generally a bad idea.