Parse HTML doc in Bash script without regex

Question

I've been tasked with programmatically splitting up an HTML master template into subtemplates, performing some minor modifications to each template, and saving the resultant files.

Why I'm using Bash (you can skip this to get to the question)

(edited) The choice to use Bash is fairly arbitrary. I know it runs Red Hat 5.5, so I'm writing a Bash script. It does have a PHP interpreter, but I decided not to use PHP for this, since the same server will be hosting the site which uses PHP and will probably see a lot of traffic, so I'm afraid of tying up a FastCGI socket every hour doing this operation (I don't control the frequency of script execution, just what is executed). I can also install whatever interpreter I want (script languages I actually know already: Perl, Python, PHP, maybe Lua). That, however, is a different question. This question assumes I want to use a Bash script.

The Problem

I have a master template file, which looks something like:

<!DOCTYPE html PUBLIC .... >
<html lang="en" ...>
<head> ... </head>
<body>
    <div id=...></div>
    <div id=...></div>
</body>
</html>

From this, I need to parse from the top of the document up to </head>, strip a few lines from that section and add one in, replace the <title> placeholder with the actual title, and save that to a file. Then I need to parse the <body> and first <div> out as a separate file, and then finally the second <div> (to which I also need to make some changes in the page footer). I will be discarding </body> and </html> since this template is actually part of a two-layer template (the replaced page title will use a Smarty variable to get its text).

The Conundrum

The question is, is there an easier/better way to do this than regex? I know Bash provides the compound command [[ htmlstring =~ "/regex/" ]] and ${BASH_REMATCH} to match or replace, but I also know that parsing HTML with regex is generally a bad idea.

I would double check to see whether Perl is already installed on your target server(s). It probably is (especially if PHP is already there too), and your code is going to be way easier to do in Perl. Also, you can run PHP scripts as shell scripts without going through a web server. — Greg Hewgill, Feb 23 '12 at 00:26
I'm not terribly efficient in Perl (only written 1 or 2 scripts in it), but I suppose I would use `WWW::Mechanize` in this case? — , Feb 23 '12 at 00:31
No, from your description it just looks like you're processing some text (master template file) and generating some more text (changed template file). No need to use `WWW::Mechanize`. — Greg Hewgill, Feb 23 '12 at 00:37
Most of the HTML is straight copy-pasting though, so I really need a library to help me target particular sections. This is why regex seems ideal, I can say "capture everything from the start of the doc until ``" in one if statement. Are you saying it'd be more performant or more reliable to do line-by-line text parsing using string comparisons? — , Feb 23 '12 at 00:45
After checking, I was advised I can install anything I want onto the server. But since the other guys want to be able to debug it, I think I'm restricted to either PHP or a shell script anyway. So I'll go with PHP. — , Feb 23 '12 at 05:53

score 1 · Accepted Answer · answered Feb 23 '12 at 00:35

1

If the HTML file you are parsing has a known fixed structure, you can use awk for this. It's not very hard to write a program that keeps state in a variable (e.g., waiting for header, parsing header, waiting for body) and do different things as you read the file. It also supports regular expressions and you can put everything into a well-structured file.

answered Feb 23 '12 at 00:35

Adiel Mittmann

1,764
9
12

This is a little helpful, but it'll blow out my line count something fierce I think. Of particular concern is the fact that I would still be using regex. My question was whether there is a better way to do it than regex: you seem to be saying "no, but use awk to do the regex". – Feb 23 '12 at 00:42
1

Indeed, you would be still using regex, although if you're not going to use a parser, regular expressions are the next "best thing". I just added this answer anyway because I think writing a script in awk for this case would be easier than plain bash. – Adiel Mittmann Feb 23 '12 at 01:02

score 1 · Answer 2 · answered Feb 23 '12 at 02:22

You can use the -H(HTML) option provided by command xmlstarlet to manipulate HTML file.

For example:

# content of template file
$ cat template.html
<!DOCTYPE html >
<html lang="en">
    <head> ... </head>
    <body>
        <div id="div1"></div>
        <div id="div2"></div>
    </body>
</html>

# update the head tag
$ xmlstarlet ed -H -u '//head' -v 'hello, world' template.html
<?xml version="1.0"?>
<!DOCTYPE html>
<html lang="en">
  <head>hello, world</head>
  <body>
    <div id="div1"/>
    <div id="div2"/>
  </body>
</html>

Is that using XPath? Also, we don't have `xmlstarlet` on the servers I don't think. Doesn't sound like a standard RHEL package. — , Feb 23 '12 at 05:44

score 0 · Answer 3 · answered Feb 23 '12 at 05:54

0

Okay, so I'm going with PHP and I'll use standard string manipulation. I should be able to make good use of explode to do this sort of thing. Thanks all.

answered Feb 23 '12 at 05:54

Parse HTML doc in Bash script without regex

3 Answers3