1

I'm learning awk programming language and i'm stuck to a problem here.

I've a file(awk.dat), having the following content:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci, euismod id nisi eget, interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat, et facilisis.

I'm using below command:

awk 'BEGIN{RS="*, *";ORS="<<<---\n"} {print $0}' awk.dat

it's returning me the error:

awk: run time error: regular expression compile failed (missing operand)
*, *
    FILENAME="" FNR=0 NR=0

While, if i use the command: awk 'BEGIN{RS=" *, *";ORS="<<<---\n"} {print $0}' awk.dat, it's giving me the required result.

I need to understand this part: RS=" *, *", the meaning of the space between double-quotes and * before ,, due to which it's throwing the error.

Expected Output:

Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci<<<---
euismod id nisi eget<<<---
interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat<<<---
et facilisis.
<<<---

Thanks.

User123
  • 1,498
  • 2
  • 12
  • 26
A.K
  • 98
  • 1
  • 11
  • 3
    What do you intend to mean by `*, *` ? `awk` parses it as a regex, in which language `*` is a quantifier that specifies how many times the previous token should be matched. In your first try there is no such previous token for the first `*`, while in the second it's a space character which makes the regex valid. The second variation is probably what you're looking for if you want to use a comma surrounded by any amount of spaces as RS. – Aaron Dec 04 '18 at 16:30
  • What version of awk are you using? I believe it is mawk. Can you confirm this? – kvantour Dec 04 '18 at 21:58
  • Yup, it's `mawk`. – A.K Dec 08 '18 at 09:54

3 Answers3

3
"[space1]*,[space2]*"

is a regex, it matches string with:

zero or many spaces (space1) followed by a comma, then followed by zero or many spaces (space2)

The first one "*,[space]*" was wrong, because * has special meaning in regex. It means that repeating the matched group/character zero or many times. You cannot put it at very beginning.

Kent
  • 189,393
  • 32
  • 233
  • 301
2

Be aware that, according to POSIX, RS is defined as a single character and not a regular expression.

The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.

source: Awk Posix standard

This implies that RS=" *, *" leads to undefined behaviour.

Other versions of awk, who implement extensions to POSIX, might have a different approach to what RS stands for. Examples are GNU awk and mawk. Both implement RS to be a regular expression, but both implementations are slightly different. The summary wrt to the usage of <asterisk> is:

| RS   | awk (posix)  | gawk             | mawk             |
|------+--------------+------------------+------------------|
| "*"  | "<asterisk>" | "<asterisk>"     | "<asterisk>"     |
| "*c" | undefined    | "<asterisk>c"    | undefined        |
| "c*" | undefined    | "","c","ccc",... | "","c","ccc",... |

c is any character

The above should explain the error of the OP as RS="*, *" is an invalid regular expression according to mawk.

$ echo "abc" | ./mawk '/*c/'
mawk: line 1: regular expression compile failed (missing operand)

GNU awk: The manual of GNU awk states the following:

When using gawk, the value of RS is not limited to a one-character string. It can be any regular expression (see Regexp). (c.e.) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string.

source: GNU awk manual

To understand the usage of <asterisk> in the regular expression in GNU awk, we find:

<asterisk> * This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example, ph* applies the * symbol to the preceding h and looks for matches of one p followed by any number of hs. This also matches just p if no hs are present.

There are two subtle points to understand how * works. First, the * applies only to the single preceding regular expression component (e.g., in ph*, it applies just to the h). To cause * to apply to a larger subexpression, use parentheses: (ph)* matches ph, phph, phphph, and so on.

Second, * finds as many repetitions as possible. If the text to be matched is phhhhhhhhhhhhhhooey, ph* matches all of the hs.

source: GNU Regular expression operators

It must be mentioned, however that:

In POSIX awk and gawk, the *, + and ? operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/ matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.

source: GNU Regular expression operators

Thus, setting RS="*, *", implies that it would match the strings "*,", "*, ", "*, ", ...

$ echo "a,b, c" | awk 'BEGIN{RS="*, *"}1'
a,b, c
$ echo "a*,b, c" | awk 'BEGIN{RS="*, *"}1'
a
b, c

mawk: The manual of GNU awk states the following:

12. Multi-line records
Since mawk interprets RS as a regular expression, multi-line records are easy.

source: man mawk

but

11. Splitting strings, records and files
Awk programs use the same algorithm to split strings into arrays with split(), and records into fields on FS. mawk uses essentially the same algorithm to split files into records on RS.

Split(expr,A,sep) works as follows:

  1. <snip>
  2. If sep = " " (a single space), then <SPACE> is trimmed from the front and back of expr, and sep becomes <SPACE>. mawk defines <SPACE> as the regular expression /[ \t\n]+/. Otherwise sep is treated as a regular expression, except that meta-characters are ignored for a string of length 1, e.g., split(x, A, "*") and split(x, A, /\*/) are the same.
  3. <snip>

source: man mawk

The manual makes no mention of how a regular expression starting with a meta-character should be interpreted (eg. "*c")


Note: in the GNU awk section I struck through POSIX awk, as, according to POSIX, a regular expression of the form "*, " leads to undefined behaviour. (This is independent of defining RS as RS is anyway not an ERE in POSIX awk)

The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions)

source: Awk Posix standard

and

*+?{ The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:

  • If these characters appear first in an ERE, or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
  • If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)

source: POSIX Extended Regular Expressions

kvantour
  • 25,269
  • 4
  • 47
  • 72
  • Thanks kvantour for the explanation, i've one query, it may sound stupid : In my input file, there's no `space` anywhere before a `,` . So, how does the `*` in `RS=" *, *"` matches occurrence of zero or more spaces? – A.K Dec 04 '18 at 17:04
  • @A.K, how about posting expected output sample too in your post in code tags, may be you could get more thoughts too. – RavinderSingh13 Dec 04 '18 at 17:29
  • @RavinderSingh13 : I've updated with expected output. – A.K Dec 04 '18 at 17:33
  • 1
    @A.K In regular expressions, the  `*` is a metacharacter for zero or more instances of the preceding character. So the expression `" *,"` matches zero or more spaces before a comma. – kvantour Dec 04 '18 at 19:37
  • 1
    @A.K I have updated the answer to be more accurate wrt to the differences between gnu and mawk (as I expect you are using mawk) – kvantour Dec 06 '18 at 18:56
  • 1
    @kvantour: Thanks for your precious time and the beautiful explanation. It's clearer to me now:) – A.K Dec 08 '18 at 09:55
1

Could you please try following once.

awk '{gsub(", ","<<<---" ORS)} 1;END{print "<<<---"}'   Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • @A.K, could you please this solution once too https://stackoverflow.com/a/53618562/5866580 and let me know then? – RavinderSingh13 Dec 04 '18 at 17:36
  • Thanks @RavinderSingh13,this works perfectly!!! but i've still the same query stuck in my head. In my input file, how does the `` in `RS=" *, *"` matches occurrence of zero or more spaces? – A.K Dec 04 '18 at 17:53