Using awk to process html-related Gift-format Moodle questions

Question

This is basically a awk question but it is about processing data for the Moodle Gift format, thus the tags.

I want to format html code in a question (Moodle "test" activity) but I need to replace < and > with the corresponding entities, as these will be interpreted as "real" html, and not printed. However, I want to be able to type the question with regular code and post-process the file before importing it as gift into Moodle.

I thought awk would be the perfect tool to do this.

Say I have this (invalid as such) Moodle/gift question:

::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}

What I want is a script that translates this into a valid gift question:

::q1::[html]This is a question about HTML:
<pre>
&lt;p&gt;some text&lt;/p&gt;
</pre>
and some tag:<code>&lt;img&gt;</code>
{T}

key point: replace < and > with < and > when:

inside a <pre>-</pre> bloc (assuming those tags are alone on a line)
between <code>and </code>, with arbitrary string in between.

For the first part, I'm fine. I have a shell script calling awk (gawk, actually).

awk -f process_src2gift.awk $1.src >$1.gift

with process_src2gift.awk:

BEGIN { print "// THIS IS A GENERATED FILE !" }
{
    if( $1=="<pre>" ) # opening a "code" block
    {
        code=1;
        print $0;
    }
    else
    {
        if( $1=="</pre>" ) # closing a "code" block
        {
            code=0;
            print $0;
        }
        else
        { # if "code block", replace < > by html entities
            if( code==1 )
            {
                gsub(">","\\&gt;");
                gsub("<","\\&lt;");
            }
            print $0;
        }
    }
}
END { print "// END" }

However, I'm stuck with the second requirement..

Questions:

Is it possible to add to my awk script code to process the hmtl code inside the <code> tags? Any idea ? I thought about using sed but I didn't see how to do that.
Maybe awk isn't the right tool for that ? I'm open for any suggestion on other (standard Linux) tool.

Any hint on why the downvote ? What [point](https://stackoverflow.com/help/asking) did I miss ? — kebs, Dec 01 '19 at 20:37
didn't downvote. "They" probably did, because processing html inside awk is considered a non-starter ;-) . Usually what happens is you can solve this problem, and so you continue on until you get to a problem that is not reg-ex based and that can't be solved without standing on your head (in awk) AND then you need to learn html aware processes in a big rush. (Sorry I can't recommend a replacement tool). I don't see any reason that you can't extend the code you have to manage another flag var `code2`? to perform the same substitution. Good luck. — shellter, Dec 01 '19 at 21:06
@shellter Thanks for the info. I understand processing html with awk is full of pitfalls but this is really a "side" case, Its not about processing a whole html page. — kebs, Dec 01 '19 at 21:22

kebs · Accepted Answer · 2019-12-02T15:12:05.067

Answering own question.

I found a solution by doing a two step awk process:

first step as described in question
second step by defining <code> or </code> as field delimiter, using a regex, and process the string replacement on second argument ($2).

The shell file becomes:

echo "Step 1"
awk -f process_src2gift.awk $1.src >$1.tmp

echo "Step 2"
awk -f process_src2gift_2.awk $1.tmp >$1.gift

rm $1.tmp

And the second awk file (process_src2gift_2.awk) will be:

BEGIN { FS="[<][/]?[c][o][d][e][>]"; }
{
    gsub(">","\\&gt;",$2);
    gsub("<","\\&lt;",$2);
    if( NF >= 3 )
        print $1 "<code>" $2 "</code>" $3
    else
        print $0
}

Of course, there are limitations:

no attributes in the <code> tag
only one pair <code></code> in the line
probably others...

Using awk to process html-related Gift-format Moodle questions

1 Answers1