awk with multiline regex; output filename based on awk match

Question

I'm currently trying to extract 300-odd functions and subroutines from a 22kLoC file, and decided to try to do it programmatically (I did it by hand for the 'biggest' chunks).

Consider a file of the form

declare sub DoStatsTab12( byval shortlga as string)
declare sub DoStatsTab13( byval shortlga as string)
declare sub ZOMFGAnotherSub

Other lines that start with something other than "/^sub \w+/" or "/^end sub/"

sub main

    This is the first sub: it should be in the output file mainFunc.txt

end sub

sub test

    This is a second sub

    it has more lines than the first.

    It is supposed to go to testFunc.txt

end sub

Function ConvertFileName(ByVal sTheName As String) As String

    This is a function so I should not see it if I am awking subs

    But when I alter the awk to chunk out functions, it will go to ConvertFileNameFunc.txt    

End Function

sub InitialiseVars(a, b, c)

    This sub has some arguments - next step is to parse out its arguments
    Code code code;
    more code;
    ' maybe a comment, even? 


  and some code which is badly indented (original code was written by a guy who didn't believe in structure or documentation)

    and


  with an arbitrary number of newlines between bits of code because why not? 


    So anyhow - the output of awk should be everything from sub InitialiseVars to end sub, and should go into InitialiseVarsFunc.txt

end sub

The gist: find sets of lines that begin with ^sub [subName](subArgs) and end with ^end sub

And then (and here's the bit that eludes me): save the extracted subroutine to a file named [subName]Func.txt

awk suggested itself as a candidate (I have written text-extraction regex queries in PHP in the past using preg_match(), but I don't want to count on having WAMP/LAMP availability).

My starting point is the delightfully-parsimonious (double-quotes because Windows)

awk "/^sub/,/^end sub/" fName

This finds the relevant chunks (and prints them to stdout).

The step of putting the output to a file, and naming the file after $2 of the awk capture, is beyond me.

An earlier stage of this process involved awk-ing the subroutine names and storing them: that was easy, since each sub is declared by a one-liner of the form

declare sub [subName](subArgs)

So this does that, and does it perfectly -

awk "match($0, /declare sub (\w+)/)
{print substr($3, RSTART, index($3, \"(\")>0 ? index($3, \"(\")-1: RLENGTH)
     > substr($3, RSTART, index($3, \"(\")>0 ? index($3, \"(\")-1: RLENGTH)\".txt\"}"
fName

(I've tried to present it so that it's easy to see that the output filename and $3 of the awk - parsed up to the first ')' if any - are the same thing).

It seems to me that if the output of

awk '/^sub/,/^end sub/' fName

was concatenated into one array, then $2 (appropriately truncated at '(' ) would work. But it didn't.

I have looked at various SO (and other SE-family) threads that deal with multiline awk - e.g., this one and this one, but none have given me enough of a heads-up on my problem (they help with getting the match itself, but not with piping it to a file named after itself).

I have RTFD for awk (and grep), also to no avail.

Never used `'/start/,/end/'` syntax as it makes trivial jobs very slightly briefer but then even slightly more complicated jobs require a complete rewrite. Always use `/start/{f=1} f; /end/{f=0}` instead. — Ed Morton, Jan 01 '15 at 22:09

Wintermute · Accepted Answer · 2015-01-01T05:12:34.257

I suggest

awk -F '[ (]*' '            # Field separator is space or open paren (for
                            # parameter lists). * because there may be multiple
                            # spaces, and parens only appear after the stuff we
                            # want to extract.
  BEGIN { IGNORECASE = 1 }  # case-insensitive pattern matching is probably
                            # a good idea because Basic is case-insensitive.
  /^sub/ {                  # if the current line begins with "sub"
    outfile = $2 "Func.bas" # set the output file name
    flag = 1                # and the flag to know that output should happen
  }
  flag == 1 {               # if the flag is set
    print > outfile         # print the line to the outfile
  }
  /^end sub/ {              # when the sub ends, 
    flag = 0                # unset the flag
  }
' foo.bas

Note that parsing source code with simple pattern matching tools is error-prone because programming languages are, as a rule, not regular languages (with a few exceptions along the lines of Brainfuck). This sort of thing always depends on the formatting of the code.

If, for example, somewhere in the code a sub declaration is broken up into two lines (this is possible with _, I believe, although Basic is not something I do every day), trying to extract the name of a sub from the first line of its definition is futile. Formatting may also make minor adjustments of the patterns necessary; things like superfluous spaces at the beginning of a line would require handling. Use this stuff strictly for one-off code transformations and verify that it produced the desired result, don't be tempted to make it a part of a regular workflow.

This is the right approach and since the OP is on Windows, put the script in a file and execute as `awk -f script ...` instead of trying to deal with Windows nightmare quoting rules. — Ed Morton, Jan 01 '15 at 22:06
As you surmise, this is very definitely a one-off (I had previously tidied up the code for leading spaces in sub/function declarations). Interestingly, I could not get this to work from the command line (after kludging around Windoze's failure to 'do' single quotes - by (1) escaping double quotes; (2) replacing single-quotes with double-quotes): awk throws a syntax error (I have stared at my version and can't see it). Jidder's alternative (below) did work (again, after adapting for Windoze). — GT., Jan 04 '15 at 21:14
Me again... the problem at the command prompt was that I needed a semi-colon between `outfile = $2\"Func.txt\"` and `flag = 1`; when that was done it worked exactly as desired. — GT., Jan 04 '15 at 23:20

score 1 · Answer 2 · answered Jan 02 '15 at 11:40

Another awk way

awk -F'[ (]' 'x+=(/^sub/&&file=$2"Func.txt"){print > file}/^end sub/{x=file=""}' file

Explanation

awk -F'[ (]'                   - Set field separator to space or brackets

x+=(/^sub/&&file=$2"Func.txt") - Sets x to 1 if line begins with sub and sets file 
                                 to the second field + func.txt. As this is a 
                                 condition that is checking if x is true then the 
                                 next block will repeatedly be executed until x 
                                 is unset.

{print > file}                 - Whilst x is true print the line into the set filename


/^end sub/{x=file=""}          - If line begins with end sub then set both x and file 
                                 to nothing.

awk with multiline regex; output filename based on awk match

2 Answers2

Explanation