Be aware that, according to POSIX, RS
is defined as a single character and not a regular expression.
The first character of the string value of RS
shall be the input record separator; a <newline> by default. If RS
contains more than one character, the results are unspecified. If RS
is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS
is.
source: Awk Posix standard
This implies that RS=" *, *"
leads to undefined behaviour.
Other versions of awk, who implement extensions to POSIX, might have a different approach to what RS
stands for. Examples are GNU awk and mawk. Both implement RS
to be a regular expression, but both implementations are slightly different. The summary wrt to the usage of <asterisk> is:
| RS | awk (posix) | gawk | mawk |
|------+--------------+------------------+------------------|
| "*" | "<asterisk>" | "<asterisk>" | "<asterisk>" |
| "*c" | undefined | "<asterisk>c" | undefined |
| "c*" | undefined | "","c","ccc",... | "","c","ccc",... |
c is any character
The above should explain the error of the OP as RS="*, *"
is an invalid regular expression according to mawk.
$ echo "abc" | ./mawk '/*c/'
mawk: line 1: regular expression compile failed (missing operand)
GNU awk: The manual of GNU awk states the following:
When using gawk
, the value of RS
is not limited to a one-character string. It can be any regular expression (see Regexp). (c.e.) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string.
source: GNU awk manual
To understand the usage of <asterisk> in the regular expression in GNU awk, we find:
<asterisk> *
This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example, ph*
applies the *
symbol to the preceding h
and looks for matches of one p
followed by any number of h
s. This also matches just p
if no h
s are present.
There are two subtle points to understand how *
works. First, the *
applies only to the single preceding regular expression component (e.g., in ph*
, it applies just to the h
). To cause *
to apply to a larger subexpression, use parentheses: (ph)*
matches ph
, phph
, phphph
, and so on.
Second, *
finds as many repetitions as possible. If the text to be matched is phhhhhhhhhhhhhhooey
, ph*
matches all of the h
s.
source: GNU Regular expression operators
It must be mentioned, however that:
In POSIX awk and gawk, the *
, +
and ?
operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/
matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.
source: GNU Regular expression operators
Thus, setting RS="*, *"
, implies that it would match the strings "*,"
, "*, "
, "*, "
, ...
$ echo "a,b, c" | awk 'BEGIN{RS="*, *"}1'
a,b, c
$ echo "a*,b, c" | awk 'BEGIN{RS="*, *"}1'
a
b, c
mawk: The manual of GNU awk states the following:
12. Multi-line records
Since mawk
interprets RS
as a regular expression, multi-line records are easy.
source: man mawk
but
11. Splitting strings, records and files
Awk programs use the same algorithm to split strings into arrays with
split()
, and records into fields on FS
. mawk uses essentially the same algorithm to split files into records on RS
.
Split(expr,A,sep)
works as follows:
- <snip>
- If
sep = " "
(a single space), then <SPACE> is trimmed from the front and back of expr
, and sep
becomes <SPACE>. mawk defines <SPACE> as the regular expression /[ \t\n]+/
. Otherwise sep
is treated as a regular expression, except that meta-characters
are ignored for a string of length 1, e.g., split(x, A, "*")
and split(x, A, /\*/)
are the same.
- <snip>
source: man mawk
The manual makes no mention of how a regular expression starting with a meta-character should be interpreted (eg. "*c")
Note: in the GNU awk section I struck through POSIX awk, as, according to POSIX, a regular expression of the form "*, "
leads to undefined behaviour. (This is independent of defining RS
as RS
is anyway not an ERE in POSIX awk)
The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions)
source: Awk Posix standard
and
*+?{
The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:
- If these characters appear first in an ERE, or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
- If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)
source: POSIX Extended Regular Expressions