seeking reference to understand one pattern "!_[$0]++"

Question

Am an AWK newbie, using GNU utilities ported to Windows (UNXUtils) and gawk instead of awk. A solution on this forum worked like absolute magic, and I'm trying to find a source I can read to understand better the pattern expression offered in that solution.

In Select unique or distinct values from a list in UNIX shell script an answer by Dimitre Radoulov offering the following code

zsh-4.3.9[t]%   awk '!_[$0]++' file

as a solution for selecting elements of a list with repeated and jumbled elements, listing each element only once.

I had previously used sort | uniq to do this, which worked fine for small test files. For my actual problem (extracting the list of company symbols from archival order book research data from India's National Stock Exchange for 16 days in April 2006, with 129+ million records in multiple files), the sorting burden became too much. And uniq only eliminates adjacent duplicates.

Copying the above line for my Win-GNU gawk, I used

C:\Users\PAPERS\>  cat ..\Full*_Symbols.txt | gawk "!_[$0]++"  | wc -l

946

suggesting that the 129+ million records pertained to 946 different firms, which is a VERY reasonable answer. And it took under 5 minutes on my modest Windows machine, after hours of trying to SORT wore me out.

Looked at all the awk texts I have and searched a bit online, and while for part of the pattern the explanation of why it worked is clear (! serves as NOT, $0 is the whole current record), for the underscore _ I am not able to find any explanation, and have seen ++ in examples only as "update the counter by 1."

Will be grateful for any appropriate text or web reference to understand this example fully, as I think it will help me in other related cases as well. Thanks. Best,

You don't need `sort | uniq` since `sort -u` works just fine. — Ed Morton, Jan 18 '14 at 07:51
@edmorton- is that faster than `sort | uniq`? I imagine it would be but did you ever benchmark it? — Floris, Jan 18 '14 at 08:25
Never bench marked it as I've never used `sort | uniq` and never had a script using `sort -u` that wasn't fast enough for my purposes. — Ed Morton, Jan 18 '14 at 15:45

score 9 · Accepted Answer · edited Jan 18 '14 at 06:20

9

It is really very clever!

It creates an associative array (meaning the "index" can be anything, not just a number). If the element doesn't exist (is zero) it is created (by incrementing it), and when there is a match awk performs the default action (which is to print the input line). Once the value has been found, the _[$0] will be non-zero so if the same value is encountered again the expression is false and nothing is printed.

I think the underscore is just a "vanilla" variable name (you need a name for your array and underscore is as valid as monkey but more "anonymous". A classic!

edited Jan 18 '14 at 06:20

Barmar

741,623
53
500
612

answered Jan 18 '14 at 06:05

Floris

45,857
6
70
122

1

The other thing to note is that associative arrays pretty much _are_ magic. Wonderful magic. Lots of languages have them these days, and they tend to be startlingly cheap for what they do. – Donal Fellows Jan 18 '14 at 07:56
@DonalFellows I agree they are a gem - and the access speed is much greater than you might expect ; almost independent of array size. – Floris Jan 18 '14 at 08:23
@DonalFellows and other commenters: Thanks from the bottom of my heart. This has been very helpful. – Murgie Jan 18 '14 at 09:11

score 5 · Answer 2 · answered Jan 18 '14 at 07:46

There is no explanation for the _ except that some people think it's clever to obfuscate their code by using an underscore character as the name of a variable, in this case an array. Like in C, variable names in awk can start with any letter or underscore but obviously the intent isn't to have them ONLY be an underscore - that's just ridiculous!

The more common and reasonable way to write that code is to name the array seen or similar so you have some clue what it's for:

awk '!seen[$0]++'

The above introduces an array named seen indexed by the text on the current line. When first tested the array at each index has zero value, when tested again with the same string it has value 1 and so on due to the post-increment. Therefore the negation of that value is only true when the first occurrence of a given string is seen in input and so it discards subsequent occurrences.

+1 :) I like _ as an identifier though, for me it's somehow more generic, temporary, anonymous ... default-like, just like Perl's _. For me it's OK for one-liners and for short throwaway scripts. — Dimitre Radoulov, Jan 18 '14 at 16:57
There is no tenet in software engineering that suggests anonymous identifiers are desirable. Programs are much easier to read when the functions, variables, etc. have meaningful names than if they have names like `_`. Even a single letter variable name is MUCH better than `_` since at least that doesn't look like some language construct like `[` or `(`. — Ed Morton, Jan 18 '14 at 17:29

BMW · Answer 3 · 2014-01-18T06:57:47.943

2

In another way, this command can be extended as :

awk '{if (array[$0]==0) {array[$0]+=1;print}}'

You can understand as:

_ represents associative array named "array"

!_[$0]  represents (array[$0]==0)

_[$0]++  represents array[$0]+=1

edited Jan 18 '14 at 06:57

answered Jan 18 '14 at 06:48

BMW

42,880
12
99
116

1

+1 for very clear alternative to my explanation - but to be picky the `print` would happen after the increment operation, not before... – Floris Jan 18 '14 at 06:55

Jotne · Answer 4 · 2014-01-18T08:47:49.287

It did take me hour before I first time understand this use of array. So to help my self some time back I did examine what was going on.

So I devided it up and examined it using some test. _[$0] is change to A[$0]
!A[$0]++ becomes
Test if array A[$0] is not ! true, and print the line if its not true, since it has no default action and the default action of awk is to print the line.
After the test it add 1 to the array since A[$0]++ = A[$0]=A[$0]+1. With the ++ behind the array, the increment is done after the test.

So !A[$0]++ can be change to:

{if (!A[$0]++) print $0}

and some extra info text

{if (!A[$0]++) print "output="$0; else print "output="}

With this data as input

cat file
one
two
three
four
two
five
three
six

I get this output:

awk '{printf "line=%s array=%s ",$0,A[$0]} {if (!A[$0]++) print "output="$0; else print "output="}'
line=one array= output=one
line=two array= output=two
line=three array= output=three
line=four array= output=four
line=two array=1 output=
line=five array= output=five
line=three array=1 output=
line=six array= output=six

With information.

awk '{printf "line=%s array=%s ",$0,A[$0]} {if (!A[$0]++) print "output="$0; else print "output="}'
line=one array= output=one          # line is `one` and since its not found before array is blank (same as 0) and not true, print the line
line=two array= output=two          # line is `two` and since its not found before array is blank (same as 0) and not true, print the line
line=three array= output=three      # line is `threw` and since its not found before array is blank (same as 0) and not true, print the line
line=four array= output=four        # line is `four` and since its not found before array is blank (same as 0) and not true, print the line
line=two array=1 output=            # line is `two` and its found before giving array 1 and true, do not print the line
line=five array= output=five        # line is `five` and since its not found before array is blank (same as 0) and not true, print the line
line=three array=1 output=          # line is `three` and its found before giving array 1 and true, do not print the line
line=six array= output=six          # line is `six` and since its not found before array is blank (same as 0) and not true, print the line

so second line with two and three will not be printed.

Using the original expression on the data give only unique value:

awk '!_[$0]++' file
one
two
three
four
five
six

To get all the duplicate:

awk '_[$0]++'
two
three

seeking reference to understand one pattern "!_[$0]++"

4 Answers4

Linked