awk: read pattern from file, awk '$2 !~ /{newline delimited file}/ && $1 > 5000'

Question

I have a command that that pipes output like:

   1365 8.67.53.0.9
   2657 1.2.3.4
   5956 127.9.9.0
  10463 127.0.0.1
  15670 6.5.4.3.2
  17984 -

to:

awk '$2 !~ /-|127.0.0.1|6.5.4.3.2/ && $1 > 5000'

which should print:

   5956 127.9.9.0

or all the ones where $2 doesn't contain -, 127.0.0.1, or 6.5.4.3.2 and where $1 is greater than 5000.

I would like to keep all of the values that should be ignored in a newline delimited file like:

-
127.0.0.1
6.5.4.3.2

rather than within the regex /-|127.0.0.1|6.5.4.3.2/ because my list of these will be growing.

Ideally, this could be within a single command and not a function or awk program file. Also, if possible I would like the matching to be more exact (less greedy?). I think the current regex will also match something like 127.0.0.11 or 6.5.4.3.22.

Note that `.` in a regex doesn't only match itself; it matches _anything_. So `/1.2.3.4/` matches `1A2B3C4`. — Charles Duffy, Aug 08 '23 at 17:52
Speaking to your question, though -- awk supports a map datatype, so you can easily add keys to your map to track elements in the ignore file as you read it (take the ignore file as the first input and the data file as the second input, and then you can decide which kind of processing to do depending on which file you're on). Look for any of the (many, many) examples already on this site of folks using `awk` to act like `uniq` and you'll see all the moving parts you need already in use. — Charles Duffy, Aug 08 '23 at 17:53
Your regexp isn't anchored so `/1.2.3.4/` also matches the middle of `21.2.3.45`. You should be using full-line string matching, not partial-word regexp matching. See [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern) for more details on the issue. — Ed Morton, Aug 08 '23 at 18:28
You have an extra `/` in your regex pattern; shouldn't the middle one be a `|`? — Mark Reed, Aug 08 '23 at 18:40

anubhava · Accepted Answer · 2023-08-08T18:53:31.523

4

You can keep value to be skipped in a file called skip like this:

cat skip

-
127.0.0.1
6.5.4.3.2

Then run awk using both files as:

awk 'NR == FNR {omit[$1]; next} $1 > 5000 && !($2 in omit)' skip file

  5956 127.9.9.0

Here:

While processing first file i.e. skip we store all the values in an array omit.
Then while processing main file we simply check if $1 > 5000 and $2 doesn't exist in array omit.

edited Aug 08 '23 at 18:53

answered Aug 08 '23 at 18:23

anubhava

761,203
64
569
643

1

Thanks! This seems to work. Is it possible to read the initial $1 and $2 from the output of a previous command piped to the awk, as in, cat file | awk {magic stuff here} – Special Monkey Aug 08 '23 at 18:36
1

@SpecialMonkey use `-` to designate input from stdin, eg: `cat file | awk 'magic stuff here' skip -` ; note the last 2 args ... `skip -` => first file processed is `skip` while second 'file' processed is actually stdin (`-`) – markp-fuso Aug 08 '23 at 18:38

Paolo · Answer 2 · 2023-08-08T18:37:53.257

Given input file:

127.0.0.1
6.5.4.3.2

and file file:

   1365 8.67.53.0.9
   2657 1.2.3.4
   5956 127.9.9.0
  10463 127.0.0.1
  15670 6.5.4.3.2
  17984 -

# read input file and perform parameter substitution
$ ips=$(< input); ips=${ips//$'\n'/|}; ips=${ips//./[.]};
# create variable for regex
$ regex="^(-|${ips})$"
# pass regex to awk as variable and run logic
$ awk -v regex="$regex" '$2 !~ regex && $1 >5000' file
   5956 127.9.9.0

awk: read pattern from file, awk '$2 !~ /{newline delimited file}/ && $1 > 5000'

2 Answers2