1

The code print split("foo:bar", a) returns how many slices did split() when trying to cut based on the field separator. Since the default field separator is the space and there is none in "foo:bar", the result is 1:

$ awk 'BEGIN{print split("foo:bar",a)}'
1

However, if the field separator is ":" then the result is obviously 2 ("foo" and "bar"):

$ awk 'BEGIN{FS=":"; print split("foo:bar", a)}'
2
$ awk -F: 'BEGIN{print split("foo:bar", a)}'
2

However, it does not if FS is defined after the Awk expression:

$ awk 'BEGIN{print split("foo:bar", a)}' FS=":"
1

If I print it not in the BEGIN block but when processing a file, the FS is already taken into account:

$ echo "bla" > file
$ awk '{print split("foo:bar",a)}' FS=":" file
2

So it looks like FS set before the expression is already taken into account in the BEGIN block, while it is not if defined after.

Why is this happening? I could not find details on this in GNU Awk User's Guide → 4.5.4 Setting FS from the Command Line. I am working on GNU Awk 5.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • 1
    Seems to be related to my post what I have found: https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script If variable is defined after the code block, it will not work with BEGIN block. – Jotne Aug 21 '19 at 09:24
  • @Jotne interesting! kvantour's answer here will be a good reference on _why_ this is happening. Good to see you here again, by the way :) – fedorqui Aug 21 '19 at 11:13

2 Answers2

3

Because you can set the variable individually for each file you process, and BEGIN happens before any of that.

bash$ awk '{ print NF }' <(echo "foo:bar") FS=: <(echo "foo:bar")
1
2
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • So `-F` happens before the `BEGIN` block and `FS=:` (on the right) happens before the `BEGINFILE` block but after the `BEGIN`? – fedorqui Aug 21 '19 at 08:10
  • 1
    Quite so, yes. Options are handled before `BEGIN`, non-option arguments after. – tripleee Aug 21 '19 at 08:13
  • 1
    @fedorqui exactly. The flag `-F` is processed by awk as a flag and not as an argument which is located in `ARGV`. Anything located in `ARGV` is processed after `BEGIN`. `BEGINFILE` has no effect on that. Awk will process an element of `ARGV` in the following way. If it is a file, it will run `BEGINFILE`, if it is a variable assignement, it will assign the definition. As a consequence, it will do so before `BEGINFILE`. – kvantour Aug 21 '19 at 08:15
  • @kvantour aaah that's it! In [2.3 Other Command-Line Arguments](https://www.gnu.org/software/gawk/manual/gawk.html#Other-Arguments) _The distinction between file name arguments and variable-assignment arguments is made when awk is about to open the next input file. At that point in execution, it checks the file name to see whether it is really a variable assignment; if so, awk sets the variable instead of reading a file_ and then _the values of variables assigned in this fashion are not available inside a BEGIN rule_. – fedorqui Aug 21 '19 at 08:20
  • @fedorqui also be aware that this works : `awk -F: '1' /dev/null`, but this does not `awk '1' -F: /dev/null` – kvantour Aug 21 '19 at 08:22
3

This feature is not inherent to GNU awk but is POSIX.

Calling convention:

The awk calling convention is the following:

awk [-F sepstring] [-v assignment]... program [argument...]
awk [-F sepstring] -f progfile [-f progfile]... [-v assignment]...
       [argument...]

This shows that any option (flags -F,-v,-f) passed to awk should occur before the program definition and possible arguments. This shows that:

# this works
$ awk -F: '1' /dev/null
# this fails
$ awk '1' -F: /dev/null
awk: fatal: cannot open file `-F:' for reading (No such file or directory)

Fieldseparators and assignments as options:

The Standard states:

-F sepstring: Define the input field separator. This option shall be equivalent to: -v FS=sepstring

-v assignment: The application shall ensure that the assignment argument is in the same form as an assignment operand. The specified variable assignment shall occur prior to executing the awk program, including the actions associated with BEGIN patterns (if any). Multiple occurrences of this option can be specified.

source: POSIX awk standard

So, if you define a variable assignment or declare a field separator using the options, BEGIN will know them:

$ awk -F: -v a=1 'BEGIN{print FS,a}'
: 1

What are arguments?:

The Standard states:

argument: Either of the following two types of argument can be intermixed: file

  • A pathname of a file that contains the input to be read, which is matched against the set of patterns in the program. If no file operands are specified, or if a file operand is '-', the standard input shall be used. assignment
  • An <snip: extremely long sentence to state varname=varvalue>, shall specify a variable assignment rather than a pathname. <snip: some extended details on the meaning of varname=varvalue> Each such variable assignment shall occur just prior to the processing of the following file, if any. Thus, an assignment before the first file argument shall be executed after the BEGIN actions (if any), while an assignment after the last file argument shall occur before the END actions (if any). If there are no file arguments, assignments shall be executed before processing the standard input.

source: POSIX awk standard

Which means that if you do:

$ awk program FS=val file

BEGIN will not know about the new definition of FS but any other part of the program will.

Example:

$ awk -v OFS="|" 'BEGIN{print "BEGIN",FS,a,""}END{print "END",a,""}' FS=: a=1 /dev/null
BEGIN| ||
END|:|1|
$ awk -v OFS="|" 'BEGIN{print "BEGIN",FS,a,""}
                  {print "ACTION",FS,a,""}
                  END{print "END",a,""}' FS=: a=1 <(echo 1) a=2
BEGIN| ||
ACTION|:|1|
END|:|2|

See also:

kvantour
  • 25,269
  • 4
  • 47
  • 72
  • You've created a great piece of content, many thanks for it! Do you think it is worth indicating the reference to Other Command-Line arguments [I mentioned in tripleee's answer](https://stackoverflow.com/questions/57586974/why-is-field-separator-taken-into-account-differently-if-set-before-or-after-the/57587802#comment101632200_57587025)? – fedorqui Aug 21 '19 at 09:03
  • 1
    @fedorqui I'll include it as it mentions `ARGV` and `ARGC`. It looks like a longer version of my short comment under tripleee's answer. However, I disagree with the last sentence: _Given the variable assignment feature, the `-F` option for setting the value of FS is not strictly necessary. It remains for historical compatibility._. Using `-F` can be necessary when using `getline` in a `BEGIN` statement, but you can also define `FS` there. – kvantour Aug 21 '19 at 09:17