Multiple options of nf for identify duplicate in different positions awk?

Question

I hope you find yourself well, I am writing to know if it is possible to do something like this in awk

I NEED SOMETHING LIKE MANY CASE OF NF... FOR NF = 7 PK IS $1,$5, BUT FOR NF=8 $1,$6

INPUT

AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|B10|CCC|DDD|000|20200127|JONH3
AAA|BBB|MMM|DDD|444|20200131|JONH4
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL

DESIRE OUTPUTS

file .PK_OK_1

AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3

file DUPLICATE_PK_1

AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4

file PK_OK_2

AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY

file DUPLICATE_PK_2

AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY

file INVALID_LENGHT

AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL

MY CODE IS something like this (NOM_ARCH IS A VARIABLE)

BEGIN { FS="|";
        OFS="|"
          }


NF == 7 {
            if (!seen[$1,$5]) {
                print > NOM_ARCH".PK_OK_1"; seen[$1,$5]=1
             }else{
                 print > NOM_ARCH".DUPLICATE_PK_1"
                }
          next 
          }
NF == 8 {
            if (!seen[$1,$6]) {
                print > NOM_ARCH".PK_OK_2"; seen[$1,$6]=1
             }else{
                 print > NOM_ARCH".DUPLICATE_PK_2"
                }
          next 
          }
{ print > NOM_ARCH".INVALID_LENGHT" }

Welcome to SO, kudos for adding your code in your question. Could you please elaborate on why first 4 lines `AAA|BBB|CCC|DDD|111|20220129|JONH1 AAA|XXX|YYY|DDD|444|20210115|JONH2 AAA|B10|CCC|DDD|000|20200127|JONH3 AAA|BBB|MMM|DDD|444|20200131|JONH4` are in different files? As they have same NF value. — RavinderSingh13, Jun 18 '21 at 14:18
Because after passing the validation of the NF it has to validate if they are duplicates or not and they are taken to different files depending on whether or not it has duplicate pk — dev_vbf, Jun 18 '21 at 14:29
if change de NF .. change the validation of pk.. FOR NF = 7 PK IS $1,$5, BUT FOR NF=8 $1,$6 — dev_vbf, Jun 18 '21 at 14:33
Could you please do mention by example what are the field values you want to take to check, eg: for NF=7 use `AAA` and `DDD` etc as values to check for duplicates, that will make it more clear. — RavinderSingh13, Jun 18 '21 at 14:40
Please don't shout (all those capitals!). When you say `NOM_ARCH IS A VARIABLE` do you mean it's a shell variable or an awk variable? — Ed Morton, Jun 18 '21 at 14:57
is a variable in shell that receives the awk awk -F '' -v NOM_ARCH=${VAR_INPUT} — dev_vbf, Jun 18 '21 at 15:11
Please don't use all-upper-case names for shell or awk variables to avoid clashes with builtin variable names and to make your code clearer. See [correct-bash-and-shell-script-variable-capitalization](https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization) for more info on shell variable naming conventions. — Ed Morton, Jun 18 '21 at 16:12

RavinderSingh13 · Answer 1 · 2021-06-18T15:32:51.073

With your shown samples, please try following awk code.

awk '
BEGIN{ FS=OFS="|" }
{
  if(NF==7){ key=($1 FS $5) }
  if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
  arr1[key]++
  next
}
NF==7{
  outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
NF==8{
  outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
NF>8{
  outputFile="file_INVALID_LENGHTH"
}
{
  print > (outputFile)
}
' Input_file  Input_file

OR use following code without ternary operators as per OP's request:

awk '
BEGIN{ FS=OFS="|" }
{
  if(NF==7){ key=($1 FS $5) }
  if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
  arr1[key]++
  next
}
NF==7{
  if(arr1[key]==1){ outputFile="file.PK_OK_1"       }
  else            { outputFile="file_DUPLICATE_PK_1"}
}
NF==8{
  if(arr1[key]==1){ outputFile="file.PK_OK_2"       }
  else            { outputFile="file_DUPLICATE_PK_2"} 
}
NF>8{
  outputFile="file_INVALID_LENGHTH"
}
{
  print > (outputFile)
}
' Input_file  Input_file

Explanation: Adding detailed explanation for above.

## Starting awk program from here.
awk '
## Starting BEGIN section of this program from here, setting FS and OFS to | here.
BEGIN{ FS=OFS="|" }
##Starting main program from here.
{
##Checking condition if NF is 7 then set key to $1 FS $5.
  if(NF==7){ key=($1 FS $5) }
##Checking condition if NF is 8 then set key to $1 FS $6.
  if(NF==8){ key=($1 FS $6) }
}
##Checking condition FNR==NR which will be TRUE when 1st time Input_file is being read.
FNR==NR{
##Creating array arr1 with index of key and keep increasing same key value with 1 here.
  arr1[key]++
##next will skip all further statements from here.
  next
}
##Checking condition if NF==7 then do following.
NF==7{
##Setting outputFile(where contents will be written to), either file.PK_OK_1 OR file_DUPLICATE_PK_1 depending upon value of arr1.
##Basically it uses ternary operators ? and :
##Statements after ? will executed if condition arr1[key]==1 is TRUE.
##Statements after : will be executed if condition ar1[key]==1 is FALSE.
  outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
##Checking condition if NF==8 then do following.
NF==8{
##Setting outputFile(where contents will be written to), either file.PK_OK_2 OR file_DUPLICATE_PK_2 depending upon value of arr1.
  outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
##Checking condition if NF>8 then do following.
NF>8{
##Setting outputFile(where contents will be written to) to file_INVALID_LENGHTH here.
  outputFile="file_INVALID_LENGHTH"
}
{
##Printing current line to outputFile(already set its value above)
  print > (outputFile)
}
##Mentioning Input_file names here.
' Input_file  Input_file

sorry but why is the input passed to awk twice? I am trying to optimize a process of 10,000,000 records, will that not retry the process? — dev_vbf, Jun 18 '21 at 15:14
@dev_vbf, I was waiting for your response, will add detailed explanation in a min or so. 2 times file is processed to get count of values(1st, 5th fields, 1st, 6th fields) and processing actual things in 2nd read of file. — RavinderSingh13, Jun 18 '21 at 15:16
@dev_vbf, `I am trying to optimize a process of 10,000,000 records, will that not retry the process` are you running this `awk` code in a loop or something? — RavinderSingh13, Jun 18 '21 at 15:17
outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1") how can i write this more simple like if else with {} ? xD i dont understand much.. — dev_vbf, Jun 18 '21 at 15:20
@dev_vbf, Let me add detailed explanation now for it and let you know here. — RavinderSingh13, Jun 18 '21 at 15:21
@dev_vbf, detailed explanation is added in solution now, please do check and let me know in case of any queries. — RavinderSingh13, Jun 18 '21 at 15:27
How can I write this line without ternary operators, to understand how the ternary operator works ` outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")` — dev_vbf, Jun 18 '21 at 15:30
@dev_vbf, Ok sure, I have added OR solution using simple if/else here. Kindly do check once. — RavinderSingh13, Jun 18 '21 at 15:33
Thank you very much for your help!!! I did not know how to do it .. I also did not know that I could have many 'ends' I will test it in my 10,000,000 file! (It does not have loops, only the awk) — dev_vbf, Jun 18 '21 at 15:39

Ed Morton · Answer 2 · 2021-06-18T16:09:27.280

4

Normally I'd recommend a first pass with sort and uniq -c for efficiency but I started out assuming the wrong requirements and so wrote most of this under that assumption and so I've just tweaked it now for the real requirements and so here's how to do it all in one awk script:

$ cat tst.awk
BEGIN {
    FS=OFS="|"
    map[7] = 1
    map[8] = 2
}
{ key = $1 FS $(NF-2) FS NF }
NR==FNR {
    cnt[key]++
    next
}
{
    if ( NF in map ) {
        sfx = ( cnt[key]>1 ? "DUPLICATE_PK" : "PK_OK" ) "_" map[NF]
    }
    else {
        sfx = "INVALID_LENGTH"
    }
    print > (nom_arch "." sfx)
}

$ awk -v nom_arch='foo' -f tst.awk file file

$ head foo.*
==> foo.DUPLICATE_PK_1 <==
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4

==> foo.DUPLICATE_PK_2 <==
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY

==> foo.INVALID_LENGTH <==
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL

==> foo.PK_OK_1 <==
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3

==> foo.PK_OK_2 <==
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY

I corrected the spelling of LENGTH above.

Note that NF is included in key = $1 FS $(NF-2) FS NF so we avoid a potential case pointed out by @rowboat where a line with 7 fields has the same $1 and $(NF-2) as a line with 8 fields and so we would otherwise end up counting that twice when it should be 2 separate counts of 1.

We could have used NF-6 instead of map[NF] when setting the sfx but the map[] is useful for identifying valid NF values too and there may be other values of NF in future for which the sfx can't be determined by just subtracting 6.

edited Jun 18 '21 at 16:09

answered Jun 18 '21 at 14:56

Ed Morton

188,023
17
78
185

why key is $1 FS $(NF-2) ... what that mean ? – dev_vbf Jun 18 '21 at 15:46
1

It means I'm creating an index for the `cnt[]` array using the 2 fields you care about from each valid line of input - the first field and the 3rd-last field. – Ed Morton Jun 18 '21 at 15:48
map[7] = 1 map[8] = 2 and the 1 and 2 are positions? is a complex code but works! – dev_vbf Jun 18 '21 at 15:50
2

What's complex about it? For NF of 7 you want an output file with the suffix 1, and for NF of 8 you want an output file with the suffix 2 so I'm just using an array to define that mapping so later in the code you can get the suffix by using `sfx=map[NF]` instead of `if (NF==7) sfx=1; else if (NF==8) sfx=2`. It's **different** from your approach but it's actually much simpler, not more complex, and requires less code and less redundancy of code. Just take a few minutes to think about what each step does - it's all very straight forward. – Ed Morton Jun 18 '21 at 15:53
and if, for example, in the future I would like not to have only 2 pks, to have more pks like 5 ... how could I make that change? and another NF Example 7,8 and 10 .. just to imagine it .. – dev_vbf Jun 18 '21 at 15:58
@dev_vbf Whatever it is you need to do will be trivial if you just follow the same structure of code as in my answer but [chameleon questions](https://meta.stackexchange.com/questions/43478/exit-strategies-for-chameleon-questions) are strongly discouraged on this forum, ask a new followup question if you have one. – Ed Morton Jun 18 '21 at 15:59

glenn jackman · Accepted Answer · 2021-06-18T17:24:18.247

2

This uses GNU awk for multidimensional arrays:

# classify.awk
BEGIN {
    FS = "|"
    ok[7] = ".PK_OK_1"; dup[7] = ".DUPLICATE_PK_1"
    ok[8] = ".PK_OK_2"; dup[8] = ".DUPLICATE_PK_2"
}
NF < 7 || NF > 8 {
    print > nom_arch".INVALID_LENGTH"
    next
}
{
    pk = $1 SUBSEP (NF == 7 ? $5 : $6)
    count[NF][pk]++
    lines[NF][pk] = lines[NF][pk] $0 ORS
}
END {
    for (nf in count)
        for (pk in count[nf]) {
            outfile = nom_arch (count[nf][pk] == 1 ? ok[nf] : dup[nf])
            sub(ORS"$", "", lines[nf][pk])
            print lines[nf][pk] > outfile
        }
}

Then this will produce the desired output files

gawk -f classify.awk -v nom_arch="foo" file

The awk SUBSEP variable is used in array keys when you do something like

var[x,y] = 10

awk uses the value of SUBSEP to join the values of x and y.

The default SUBSEP value is octal value 034, an ASCII character unlikely to appear in text data.

This version is more portable, does not require GNU awk

BEGIN {
    FS = "|"
    ok[7] = ".PK_OK_1"; dup[7] = ".DUPLICATE_PK_1"
    ok[8] = ".PK_OK_2"; dup[8] = ".DUPLICATE_PK_2"
}
NF < 7 || NF > 8 {
    print > (nom_arch".INVALID_LENGTH")
    next
}
{
    pk = NF SUBSEP $1 SUBSEP (NF == 7 ? $5 : $6)
    count[pk]++
    lines[pk] = lines[pk] $0 ORS
}
END {
    for (pk in count) {
        sub(ORS"$", "", lines[pk])
        nf = pk; sub(SUBSEP".*", "", nf)
        outfile = nom_arch (count[pk] == 1 ? ok[nf] : dup[nf]) 
        print lines[pk] > outfile
    }
}

edited Jun 18 '21 at 17:24

answered Jun 18 '21 at 16:22

glenn jackman

238,783
38
220
352

@glenn jackman It is my first time with this type of structures in awk .. so I can not understand where in the code the pks to be used are being specified – dev_vbf Jun 18 '21 at 16:58
I added a note about SUBSEP. This bit `pk = $1 SUBSEP (NF == 7 ? $5 : $6)` takes the first field, and joins it with $5 (if NF == 7) else $6, and assigns to the `pk` primary key variable. – glenn jackman Jun 18 '21 at 17:25
i try change the operator ternary to { } and if and not works .. do you have and example to change `outfile = nom_arch (count[pk] == 1 ? ok[nf] : dup[nf])` --> ( is bad i know , but i try) --> `if(nom_arch (count[pk] == 1){ outfile=ok[nf] }else{ outfile= dup[nf] }` sorry im new in this world – dev_vbf Jun 18 '21 at 17:36
1

To turn the ternary into an if, do: `if (count[pk] == 1) {outfile = nom_arch ok[nf]} else {outfile = nom_arch dup[nf]}` – glenn jackman Jun 18 '21 at 18:17
1

Maybe you should go spend some time learning awk instead of apologizing. Click [tag:awk], then click the "Learn more" link. – glenn jackman Jun 18 '21 at 18:18
Thanks Glenn, if I had read enough but never seen something like the code you sent, I will try to understand it and replicate it in other projects to learn more. thank you again for your help – dev_vbf Jun 18 '21 at 18:37

Paul Hodges · Answer 4 · 2021-06-18T20:28:17.187

If it's ok to put the first occurrence of a dup in with the OK's, then one pass is easy.

NOM_ARCH=/tmp/mytest
awk -v nom_arch="$NOM_ARCH" ' BEGIN { FS=OFS="|" }
{ if (NF ~ /^[78]$/) { key=($1 FS NF-2) } else { print > (nom_arch ".INVALID_LENGTH"); next; }
  print > ( nom_arch "." ( seen[key]++ ? "DUPLICATE_PK" : "PK_OK" ) "_" NF-6 ) } ' file

c.f. AAA|B10|CCC|DDD|000|20200127|JONH3 and AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY which land in the OK files as the first hit, but subsequent dups get seen and directed elsewhere. Note that it might still be faster to shift those records between smaller files on a second pass after the fact.

Personally, I'd probably just split the records to key-sorted files by NF first. Then the second pass each is easy.

NOM_ARCH=/tmp/mytest
# this pre-sort is likely the slow part, though smaller files and in parallel
awk 'BEGIN { FS=OFS="|" } { k2=NF-2; print | "sort -t\\| -k1,1 -k"k2","k2">NF"NF; }' file
shopt -s extglob; cat NF!([78]) > $NOM_ARCH.INVALID_LENGTH &
for f in NF[78]; do
  awk -v nom_arch="$NOM_ARCH" '
    BEGIN { FS=OFS="|"; lastkey=""; lastrec=""; }
    END   { if(""!=lastrec){print lastrec>f} }
    { key=($1 FS $(NF-2));
      if ( key==lastkey ) {
         f=(nom_arch".DUPLICATE_PK_"NF-6);
         if(""!=lastrec){print lastrec>f}
         print $0>f;
         lastrec="";
      } else {
         if(""!=lastrec){print lastrec>f}
         f=(nom_arch".PK_OK_"NF-6);
         lastkey=($1 FS $(NF-2));
         lastrec=$0;
      }
    }' "$f" &
done
wait

Now your data should be sorted to files. This likely reorders the records in those files (see below), so if that matters you should add sorts to those outputs as well.

mytest.PK_OK_1:

AAA|B10|CCC|DDD|000|20200127|JONH3
AAA|BBB|CCC|DDD|111|20220129|JONH1

mytest.PK_OK_2:

AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY

mytest.DUPLICATE_PK_1:

AAA|BBB|MMM|DDD|444|20200131|JONH4
AAA|XXX|YYY|DDD|444|20210115|JONH2

mytest.DUPLICATE_PK_2:

AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY

mytest.INVALID_LENGTH:

 AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL

This uses more disk space but less memory than an internal lookup table, and is likely a lot slower. YMMV.

Please test on a larger but not *huge* dataset before committing to that logic. — Paul Hodges, Jun 18 '21 at 18:42

Multiple options of nf for identify duplicate in different positions awk?

4 Answers4