Awk command to get unique lines in a file with multiple column

Question

I am trying to get the unique lines in a file with multiple columns.

My file "file.txt" contains sample record below

20230830,52678,004,Apple,21
20230830,52678,004,Apple,20
20230830,52678,004,Apple,19
20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

My desired output is to print only uniques lines from column 1 to 4. Regardless of the value on their column 5

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

I tried using sed to add a unique separator between columns 1-4 and column 5

And then I use awk command to get unique lines from col 1-4

sed 's/,/|/4' file.txt | awk -F"|" '{arr[$1]++} END{for(i in arr) if(arr[i]==1) print $0}'

With this code, it works with small set of data but when I use in a file with 1000 lines, I get...

20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
...

unique values keeps on comming. They are duplicating. Seems like I'm only getting one unique line and it's keeps duplicating.

Can you help me if there's something wrong with my code?

I am expecting to print only unique lines like this

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

What part of your code is supposed to eliminate duplicates? – Scott Hunter Sep 01 '23 at 14:41 — Scott Hunter, Sep 01 '23 at 14:41
@markp-fuso I've edited the code. You can try it again – Candy Sep 01 '23 at 14:53 — Candy, Sep 01 '23 at 14:53

markp-fuso · Accepted Answer · 2023-09-01T15:24:24.613

2

There's no need to use sed to convert the field delimiter from , to | since awk is able to parse the file on ,.

One awk idea:

awk  '
BEGIN { FS=OFS="," }
      { key = $1 OFS $2 OFS $3 OFS $4
        lines[key]=$5
        counts[key]++
      }
END   { for (i in counts) 
            if (counts[i]==1)                # unique if count == 1
               print i,lines[i]
      }
' file.txt

This generates:

20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

NOTE: the order in which array indices is processed is not guaranteed; if the output must be sorted in a specific order we could add more code

re: OP's comment: I just need to make it in one line. A couple options come to mind:

Jam current code into one line, eg:

awk 'BEGIN {FS=OFS=","} {key = $1 OFS $2 OFS $3 OFS $4;lines[key]=$5;counts[key]++} END {for (i in counts) if (counts[i]==1) print i,lines[i]}' file.txt

Place the awk code into a separate file (eg, key.awk) then reference the file in the awk invocation, eg:

$ cat key.awk
BEGIN { FS=OFS="," }
      { key = $1 OFS $2 OFS $3 OFS $4
        lines[key]=$5
        counts[key]++
      }
END   { for (i in counts) 
            if (counts[i]==1)
               print i,lines[i]
      }

$ awk -f key.awk file.txt
20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

edited Sep 01 '23 at 15:24

answered Sep 01 '23 at 14:57

markp-fuso

28,790
4
16
36

It works . Thank you so much. I just need to make it in one line. – Candy Sep 01 '23 at 15:11
Would there be a little change in the your code if I want to reverse the logic? Say I want to get the duplicates only now. – Candy Sep 01 '23 at 15:41
change `if (counts[i]==1)` to `if (counts[i]>1)` – markp-fuso Sep 01 '23 at 15:51
I did try that one but, It only prints duplicate line , one for each group. And now I'm stuck – Candy Sep 01 '23 at 15:57
I want to print all duplicate lines – Candy Sep 01 '23 at 15:58
1

ah, yeah, that's right ... need to store multiple rows; at this point you're modifying the requirements of the original question; [chameleon questions](https://meta.stackexchange.com/questions/43478) are frowned upon; the suggested approach is to take the answer(s) you've received so far, see if you can modify the answer(s) to address the new requirement and if you run into issues then ask a new question – markp-fuso Sep 01 '23 at 16:03
1

@Candy a recent [Q&A](https://stackoverflow.com/q/77021167) that may be of interest – markp-fuso Sep 01 '23 at 17:27

dawg · Answer 2 · 2023-09-01T16:28:27.340

This Unix pipe will identify lines in your file that are duplicates based on the first 4 fields:

$ cut -d, -f 1-4 file | uniq -d
20230830,52678,004,Apple

You can then use grep to inverse that match so duplicates are skipped:

$ grep -vF -f <(cut -d, -f 1-4 file | uniq -d) file

Prints:

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

The advantage here is lower memory requirement since you do not have to hold the whole file in memory.

score 1 · Answer 3 · answered Sep 01 '23 at 14:55

1

Awk can solve your problem alone:

kent$ awk -F, '{k=$1 FS $2 FS $3 FS $4; a[k]++; b[k]=$0}
               END{for(x in a) if(a[x]==1)print b[x]}' file
20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

answered Sep 01 '23 at 14:55

Kent

189,393
32
233
301

score 1 · Answer 4 · answered Sep 02 '23 at 08:45

1

$ awk -F'[^,]*$' 'FNR==NR{a[$1]++; next} a[$1]==1' inputfile inputfile
20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

answered Sep 02 '23 at 08:45

ufopilot

3,269
2
10
12

I like the concept of processing the file twice. Except the key is the concatenation of the first 4 fields. – glenn jackman Sep 02 '23 at 13:57
@glennjackman ufopilot is using the whole 5th field as the FS so `$1` in the script **is** the concatenation of the first 4 `,`-separated fields. Cute. – Ed Morton Sep 03 '23 at 00:51
@ufopilot `FNR==NR{a[$1]; next} $1 in a` would be more efficient in both memory usage and speed of execution than `FNR==NR{a[$1]++; next} a[$1]==1` – Ed Morton Sep 03 '23 at 00:52

score 0 · Answer 5 · answered Sep 02 '23 at 14:15

0

something wrong with my code?

$0 used inside END's action denotes last line, therefore

awk 'END{print $0}' file.txt

will give same output as

tail --lines=1 file.txt

answered Sep 02 '23 at 14:15

Daweo

31,313
3
12
25

Awk command to get unique lines in a file with multiple column

5 Answers5