Getting a weird bug (not an error) with AWK

Question

I was working with a text file named countries whose context is the following:

USSR    8649   275    Asia
Canada  3852   25     North America
China   3705   1032   Asia
USA     3615   237    North America
Brazil  3286   134    South America
India   1267   746    Asia
Mexico  762    78     North America
France  211    55     Europe
Japan   144    120    Asia
Germany 96     61     Europe
England 94     56     Europe

I am trying to get the following range pattern to work:

awk '/Europe/, /Asia/' countries

which is supposed to print every line starting from the first instance of the word "Europe", and ending at the first instance of the word "Asia".

So the output that I expected was this:

France  211    55     Europe
Japan   144    120    Asia

But the output that I am getting instead is this:

France  211    55     Europe
Japan   144    120    Asia
Germany 96     61     Europe
England 94     56     Europe

as if the second pattern was not matched. What is happening?

the output you're getting is valid; `awk` range matching is turned on/off *on-the-fly* as the start/end patterns are found; in your case: the range match is turned on for `France/Europe`, turned off for `Japan/Asia`, then turned on *again* for `Germany/Europe`; and since no more `Asia` entries are found the range match remains 'on' for the rest of the file — markp-fuso, Dec 24 '21 at 20:35
To test `the first instance of the word "Europe"` you should have had at least 2 `Europe`s before the first Asia. It's important when providing sample data to create it such that it actually tests your requirements. With the provided example you can get answers that produce the expected output from your sample input but don't actually do what you want. — Ed Morton, Dec 25 '21 at 14:16
Right now you're getting all sorts of answers that will produce all sorts of output if there's 2 Europes before the first Asia, an Asia but no Europe in the input, a Europe but no Asia, Europe or Asia appearing in the wrong column, Europe only on the line immediately before Asia, etc., all of which will produce the expected output in your question from the sample input in your question. Please [edit] your question to state all your rainy day requirements too, not just the sunny day case where there's 1 Europe followed by an Asia. — Ed Morton, Dec 25 '21 at 14:25

Ed Morton · Answer 1 · 2021-12-25T14:21:36.043

Never use range expressions (/start/,/end/) as their behavior is non-obvious for the rainy day cases (as you just discovered) and they make trivial tasks very slightly briefer but then need a complete rewrite or duplicate conditions for anything the slightest bit more interesting. See Is a /start/,/end/ range expression ever useful in awk? for more details.

Just use a flag (inBlock below) instead:

$ cat tst.awk
{
    if ( inBlock ) {
        buf = buf ORS $0
        if ( $4 == "Asia" ) {
            print buf
            inBlock = 0
        }
    }
    else if ( $4 == "Europe" ) {
        buf = $0
        inBlock = 1
    }
}

$ awk -f tst.awk file
France  211    55     Europe
Mexico  762    78     North America
France  211    55     Europe
Japan   144    120    Asia

The above was run against this input file

$ cat file
USSR    8649   275    Asia
Canada  3852   25     North America
China   3705   1032   Asia
USA     3615   237    North America
Brazil  3286   134    South America
India   1267   746    Asia
France  211    55     Europe
Mexico  762    78     North America
France  211    55     Europe
Japan   144    120    Asia
Germany 96     61     Europe
England 94     56     Europe

which is modified from the one in the question so it can be used to test the requirements:

print every line starting from the first instance of the word "Europe", and ending at the first instance of the word "Asia"

markp-fuso · Answer 2 · 2021-12-25T15:39:03.293

One awk idea using a variable (buffer) to keep track of lines we've seen 'in the range', as well as using a couple input variables to make this a bit more dynamic:

awk -v start="${range_start}" -v end="${range_end}" '
($0~start),($0~end) { buffer=buffer pfx $0
                      pfx=ORS
                      if ($0 ~ end) {
                         print buffer
                         buffer=pfx=""
                         next
                      }
                    }
' countries

NOTE:

worst case scenario we need to store the entire file in buffer (assumes host has enough memory to load entire file into memory)
a dual-pass solution could (effectively) remove any issues with loading the entire file into memory (the worst case scenario)

Take it for a test drive:

$ range_start='Europe'
$ range_end='Asia'

France  211    55     Europe
Japan   144    120    Asia

$ range_start='North America'
$ range_end='Asia'

Canada  3852   25     North America
China   3705   1032   Asia
USA     3615   237    North America
Brazil  3286   134    South America
India   1267   746    Asia
Mexico  762    78     North America
France  211    55     Europe
Japan   144    120    Asia

$ range_start='Asia'
$ range_end='Antarctica'

    -- no output

score 0 · Answer 3 · answered Dec 24 '21 at 23:03

A more flexible approach is:

$ awk '/Europe/{f=1} f; f&&/Asia/{exit}' file

France  211    55     Europe
Japan   144    120    Asia

with the first start pattern match set a flag and with the first end pattern match exit. By arranging the blocks, you can also choose not to print the start or end matching lines as well.

Carlos Pascual · Answer 4 · 2021-12-25T09:37:17.323

0

You can try this awk:

awk '$4 == "Asia" {if (key == "Europe") printf "%s\n%s\n", prev, $0}{key=$4;prev=$0}' file
France  211    55     Europe
Japan   144    120    Asia

the initial condition $4 == "Asia" gives us multiple records, so with the variables key and prev we specify and get only the two specific or desired records.
we print them using printf.

edited Dec 25 '21 at 09:37

answered Dec 25 '21 at 09:07

Carlos Pascual

1,106
1
5
8

Luuk · Answer 5 · 2021-12-25T09:33:04.390

From man gawk:

begpat, endpat

A pair of patterns separated by a comma, specifying a range of records. 
The range includes both the initial record that matches begpat and the 
final record that matches endpat.

There is not real mention of how many time a "range of records" can occur, but every time the begpat is matched the range starts (again).

Example:

if you have a file (i.e.: abc.txt):

a  1
b  2
c  3
d  4
e  1
f  2
g  3
h  4

and do: gawk '/2/,/3/' abc.txt, the result is:

b  2
c  3
f  2
g  3

and for : gawk '/2/,/1/' abc.txt, the result is:

b  2     <== start because of /2/
c  3
d  4
e  1     <== stop because of /1/
f  2     <== start because of /2/
g  3
h  4

The fourth bird · Answer 6 · 2021-12-25T18:09:01.027

0

You could start concatenating the whole line when matching Europe, stop concatenating when matching Asia, and only then print the collected lines:

awk '/Europe/ || seen {
  seen = 1
  lines = lines $0 ORS
}
seen && /Asia/ {
  printf "%s", lines
  seen = lines = ""
}' countries

Output

France  211    55     Europe
Japan   144    120    Asia

edited Dec 25 '21 at 18:09

answered Dec 25 '21 at 13:12

The fourth bird

154,723
16
55
70

Getting a weird bug (not an error) with AWK

6 Answers6