1

I have a question related to the range option in gawk BEGPAT, ENDPAT {ACTION} , it seems unsuited to my case OR MORE LIKELY the problem is my misunderstanding of how range works.

I want to print/select the records/lines between a range of dates of the form YYYY-MM-DD. The dates are in a specific FIELD/Column, they are in ascending order, and they are not unique, ie:

2021-08-01
2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

How can I select lets say, from 2021-08-02 to 2021-08-05, the actual data goes back two years, to get:

2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

I tried the following: '/2021-08-03/, /2021-08-05/{print}'

Resulting in this:

2021-08-03
2021-08-04
2021-08-05

Any help within the scope of gawk/awk is appreciated. The documentation about ranges is here, but since I'm just trying to learn to code it can difficult to understand sometimes. Perhaps there are other approaches within awk to solve this?

MarArauyo
  • 11
  • 1
  • Range expressions are most useful with `sed`. `awk` allows the use of variables, so a simple state variable used as a flag is often the best approach. – David C. Rankin Aug 21 '21 at 02:03
  • For what reason do you use `/2021-08-03/` as your starting date instead of the one you really want (`/2021-08-02/`)? – Renaud Pacalet Aug 22 '21 at 05:05

2 Answers2

1
awk -v beg='2021-08-02' -v end='2021-08-05' '
    $1 >= beg { inRange=1 }
    $1 > end { exit }
    inRange { print }
' file

Unless you're coding strictly for brevity, range expressions are never the best approach and you should always use a flag variable (which I named inRange above but f or found or whatever other name you like is fine too) instead, see Is a /start/,/end/ range expression ever useful in awk?.

If you prefer a briefer solution you can do the above with hard-coded values and a shorter variable name as:

awk '$1=="2021-08-02"{f=1} $1>"2021-08-05"{exit} f' file

Note that, among other things, the above is more efficient than using a range expression as it'll exit after the range is printed rather than continuing reading the rest of the input.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

I would say that it is unsuited to my case as you have repeats

2021-08-01
2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

so ENDPAT will trigger at first occurence of 2021-08-05. If you must use range AT ANY PRICE then you might use GNU AWK as follows, let file.txt content be

2021-08-01
2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

then

awk '/2021-08-0[25]/,/2021-08-05/{print}' file.txt

output

2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

Explanation: there are 2 ranges in 1: one from 2021-08-02 to 2021-08-05 and second from 2021-08-05 to 2021-08-05. EDIT: If composing regular expression this way is not possible you might use | i.e. awk '/2021-08-02|2021-08-05/,/2021-08-05/' file.txt as suggested in comment

(tested in GNU Awk 5.0.1)

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • In this case `awk '/2021-08-0[2-5]/' file.txt` would work the same. But there are limitations: what if the starting and ending dates differ by more than one character? I suggest to improve your answer with a more generic solution: `awk '/2021-08-02|2021-08-05/,/2021-08-05/' file.txt`. – Renaud Pacalet Aug 22 '21 at 05:13
  • That would match anywhere in each line of input but the OP said the dates they are interested in only exist in 1 column of the input so it should really be `awk '$1~/2021-08-0[25]/,$1~/2021-08-05/{print}' file.txt`, etc. (if the date is in $1). – Ed Morton Aug 22 '21 at 11:37