awk create a list for testing elements

Question

I have a list of discrete elements that I want to test inclusion of an entry from each line of my file. I'd like a succinct way to create a list or array in awk and then test each line against that list.

My list of discrete elements:

ports=(1010, 2020, 3030, 8888, 12345)

myFile:

127.0.0.1 1010
127.0.0.1 1011
127.0.0.1 12345
127.0.0.1 3333

My pseudocode:

awk '
  BEGIN {
    test_ports=[1010, 2020, 3030, 8888, 12345]
  }
  ($2 in test_ports) {
    print $0
  }
' myFile

The code below works, but it is not succinct and I don't like how it grows as the list grows, like if I get 100 ports to test against, or 1000...

awk '
  BEGIN {
    test_ports["1010"]=1
    test_ports["2020"]=1
    test_ports["3030"]=1
    test_ports["8888"]=1
    test_ports["12345"]=1
  }
  ($2 in test_ports) {
    print $0
  }
' myFile

Something like this would be good too, but the syntax isn't quite right:

for i in 1010 2020 3030 8888 12345 {test_ports[i]=1}

EDIT

This code works too and is very close to what I need, but it still seems a bit long for what it's doing.

awk '
  BEGIN {
    ports="1010,2020,3030,8888,12345"
    split(ports, ports_array, ",")
    for (i in ports_array) {test_ports[ports_array[i]] = 1}
  }
  ($2 in test_ports) {
    print $0
  }
' myFile

anubhava · Accepted Answer · 2020-11-11T18:01:28.067

2

You may use it like this:

awk '
  BEGIN {
    ports = "1010 2020 3030 8888 12345"  # ports string
    split(ports, temp)                   # split by space in array temp 
    for (i in temp)                      # populate array test_ports
       test_ports[temp[i]]
  }

  $2 in test_ports                       # print rows with matching ports
' myFile

127.0.0.1 1010
127.0.0.1 12345

A note of explanation:

temp is a numerically indexed array where the ports (1010, 2020, etc) are the array values, indexed from 1
test_ports is an associative array where the ports are the array keys and the values are null.
the elem in array operator tests if the given element is an index (aka "subscript") of the array.

Addendum: You also have option of reading ports from a file if your ports list is big like this:

awk 'NR == FNR {ports[$1]; next} $2 in ports' ports.list myfile

Or else if you have ports saved in a string then use:

ports='1010 2020 3030 8888 12345'
awk 'NR==FNR{ports[$1]; next} $2 in ports' <(printf '%s\n' $ports) myfile

127.0.0.1 1010
127.0.0.1 12345

edited Nov 11 '20 at 18:01

answered Nov 11 '20 at 15:59

anubhava

761,203
64
569
643

1

The advantage of this approach is you can pass the input string using the `-v` option: `awk -v ports="1010 2020 3030 8888 12345" 'BEGIN {split(ports, temp); ...` – glenn jackman Nov 11 '20 at 16:05
1

@glennjackman: Thanks so much adding nice explanation note – anubhava Nov 11 '20 at 16:09
on the contrary, testing against 100 ports means passing a string of 499 characters, you can see how this scales up. In the end, instead of passing the array hardcoded, you'll have it created dynamically from a string of arbitrary length – Daemon Painter Nov 11 '20 at 16:18
@DaemonPainter: If ports list is so big then it is better to put them in a file and have awk process that file and build array instead of passing a huge string. Awk will have no problems building an array of 500 ports – anubhava Nov 11 '20 at 16:20
I like the separate file approach. I would usually pass the ports.list file in as a variable and have awk build the ports_array in a BEGIN statement, like `-v ports_file=ports.list 'BEGIN {while ((getline < ports_file) > 0) {test_ports[$0]=1}} ...'` – Rusty Lemur Nov 11 '20 at 18:17
1

As shown in my answer that `awk` has better way to build array from a file instead of `C` style `while ((getline < ports_file) > 0)` loop. But you can do that as well if you like. – anubhava Nov 11 '20 at 18:19
I'm curious to know advantages of the method shown in your answer. Using while getline keeps the configuration files separate from the input/data files, and it doesn't require testing FNR==NR for every single record, which probably wouldn't matter, but I wonder if it adds up when awk is processing millions or billions of lines of text (some would then argue awk shouldn't be used). – Rusty Lemur Nov 11 '20 at 18:32
2

Using a getline loop to read a dictionary in the BEGIN section isn't the worst idea, you just have to be careful how you implement it (see case "c)" under "Applications" at http://awk.freeshell.org/AllAboutGetline) and yes, it would improve performance very, very, slightly but it's usually just not worth writing the extra code. Those who argue that awk shouldn't be used to process billions of lines of text are wrong - awk is typically faster than C for text processing because awk itself is highly optimized for common text processing functionality, unlike equivalent C code people write by hand. – Ed Morton Nov 11 '20 at 18:37
1

As one who uses awk to processes billions of lines of code each day, I agree :) But I get a lot of flak for it, or at least for using bash to wrap functionality around awk. – Rusty Lemur Nov 11 '20 at 18:46
Ah, now adding bash around it does open a can of worms - there you can REALLY mess things up :-). If you haven't read it yet, check out Stephane's answer at least at [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for a good discussion of non-obvious issues around text processing in shell and there are plenty of other gotchas! – Ed Morton Nov 11 '20 at 18:52

Ed Morton · Answer 2 · 2020-11-11T19:48:27.447

2

Since you said I'd like a succinct way to create a list or array in awk and then test each line against that list, here is a succinct way to create a list in awk and then test each line against that list:

$ awk 'index(",1010,2020,3030,8888,12345,",","$2",")' file
127.0.0.1 1010
127.0.0.1 12345

or if you prefer:

$ awk -v ports='1010,2020,3030,8888,12345' 'index(","ports",",","$2",")' file
127.0.0.1 1010
127.0.0.1 12345

edited Nov 11 '20 at 19:48

answered Nov 11 '20 at 18:07

Ed Morton

188,023
17
78
185

1

It took me a while to understand the commas, but I like it! Nice and succinct. – Rusty Lemur Nov 11 '20 at 18:11

Daemon Painter · Answer 3 · 2020-11-11T16:28:23.700

Assuming you have a large number of ports (test strings) to test against, I'd suggest matching using two files instead of a string.

Let ports.txt be the file of ports and test.txt your input test file. Be ports.txt something like this:

then run

awk 'NR==FNR{port[$0]=$0} ($2 in a){print}' ports.txt test.txt

this will create the port[] array from the first file and use it to print if matching in the second file.

This solution expands on the concept proposed in anubhava's answer, but with a concise syntax as you were looking for.

More info on the NR==NFR syntax here. A final note on re-usability: attached to an external process, you might have the same awk syntax running against the same test.txt file, changing the ports.txt file (e.g. ports1.txt, ports2.txt, portn.txt ...) so that you may match port groups instead.

score 1 · Answer 4 · answered Nov 11 '20 at 18:53

1

Assuming you have the ports in ports.txt, then you might be able to use join:

$ cat ports.txt
1010
2020
3030
8888
12345
$ join -12 -o1.1,2.1 <(sort -bk2 myFile.txt) <(sort -b ports.txt)
127.0.0.1 1010
127.0.0.1 12345

answered Nov 11 '20 at 18:53

Andreas Louv

46,145
13
104
123

I like outside-the-box solutions! – Rusty Lemur Nov 11 '20 at 18:58

awk create a list for testing elements

4 Answers4