1

I am trying to get the unique lines in a file with multiple columns.

My file "file.txt" contains sample record below

20230830,52678,004,Apple,21
20230830,52678,004,Apple,20
20230830,52678,004,Apple,19
20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

My desired output is to print only uniques lines from column 1 to 4. Regardless of the value on their column 5

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

I tried using sed to add a unique separator between columns 1-4 and column 5

And then I use awk command to get unique lines from col 1-4

sed 's/,/|/4' file.txt | awk -F"|" '{arr[$1]++} END{for(i in arr) if(arr[i]==1) print $0}'

With this code, it works with small set of data but when I use in a file with 1000 lines, I get...

20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
...

unique values keeps on comming. They are duplicating. Seems like I'm only getting one unique line and it's keeps duplicating.

Can you help me if there's something wrong with my code?

I am expecting to print only unique lines like this

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29
Candy
  • 13
  • 3

5 Answers5

2

There's no need to use sed to convert the field delimiter from , to | since awk is able to parse the file on ,.

One awk idea:

awk  '
BEGIN { FS=OFS="," }
      { key = $1 OFS $2 OFS $3 OFS $4
        lines[key]=$5
        counts[key]++
      }
END   { for (i in counts) 
            if (counts[i]==1)                # unique if count == 1
               print i,lines[i]
      }
' file.txt

This generates:

20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

NOTE: the order in which array indices is processed is not guaranteed; if the output must be sorted in a specific order we could add more code


re: OP's comment: I just need to make it in one line. A couple options come to mind:

Jam current code into one line, eg:

awk 'BEGIN {FS=OFS=","} {key = $1 OFS $2 OFS $3 OFS $4;lines[key]=$5;counts[key]++} END {for (i in counts) if (counts[i]==1) print i,lines[i]}' file.txt

Place the awk code into a separate file (eg, key.awk) then reference the file in the awk invocation, eg:

$ cat key.awk
BEGIN { FS=OFS="," }
      { key = $1 OFS $2 OFS $3 OFS $4
        lines[key]=$5
        counts[key]++
      }
END   { for (i in counts) 
            if (counts[i]==1)
               print i,lines[i]
      }

$ awk -f key.awk file.txt
20230901,47620,002,Grape,29
20230831,47689,001,Orange,15
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • It works . Thank you so much. I just need to make it in one line. – Candy Sep 01 '23 at 15:11
  • Would there be a little change in the your code if I want to reverse the logic? Say I want to get the duplicates only now. – Candy Sep 01 '23 at 15:41
  • change `if (counts[i]==1)` to `if (counts[i]>1)` – markp-fuso Sep 01 '23 at 15:51
  • I did try that one but, It only prints duplicate line , one for each group. And now I'm stuck – Candy Sep 01 '23 at 15:57
  • I want to print all duplicate lines – Candy Sep 01 '23 at 15:58
  • 1
    ah, yeah, that's right ... need to store multiple rows; at this point you're modifying the requirements of the original question; [chameleon questions](https://meta.stackexchange.com/questions/43478) are frowned upon; the suggested approach is to take the answer(s) you've received so far, see if you can modify the answer(s) to address the new requirement and if you run into issues then ask a new question – markp-fuso Sep 01 '23 at 16:03
  • 1
    @Candy a recent [Q&A](https://stackoverflow.com/q/77021167) that may be of interest – markp-fuso Sep 01 '23 at 17:27
2

This Unix pipe will identify lines in your file that are duplicates based on the first 4 fields:

$ cut -d, -f 1-4 file | uniq -d
20230830,52678,004,Apple

You can then use grep to inverse that match so duplicates are skipped:

$ grep -vF -f <(cut -d, -f 1-4 file | uniq -d) file  

Prints:

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

The advantage here is lower memory requirement since you do not have to hold the whole file in memory.

dawg
  • 98,345
  • 23
  • 131
  • 206
1

Awk can solve your problem alone:

kent$ awk -F, '{k=$1 FS $2 FS $3 FS $4; a[k]++; b[k]=$0}
               END{for(x in a) if(a[x]==1)print b[x]}' file
20230901,47620,002,Grape,29
20230831,47689,001,Orange,15
Kent
  • 189,393
  • 32
  • 233
  • 301
1
$ awk -F'[^,]*$' 'FNR==NR{a[$1]++; next} a[$1]==1' inputfile inputfile
20230831,47689,001,Orange,15
20230901,47620,002,Grape,29
ufopilot
  • 3,269
  • 2
  • 10
  • 12
  • I like the concept of processing the file twice. Except the key is the concatenation of the first 4 fields. – glenn jackman Sep 02 '23 at 13:57
  • @glennjackman ufopilot is using the whole 5th field as the FS so `$1` in the script **is** the concatenation of the first 4 `,`-separated fields. Cute. – Ed Morton Sep 03 '23 at 00:51
  • @ufopilot `FNR==NR{a[$1]; next} $1 in a` would be more efficient in both memory usage and speed of execution than `FNR==NR{a[$1]++; next} a[$1]==1` – Ed Morton Sep 03 '23 at 00:52
0

something wrong with my code?

$0 used inside END's action denotes last line, therefore

awk 'END{print $0}' file.txt

will give same output as

tail --lines=1 file.txt
Daweo
  • 31,313
  • 3
  • 12
  • 25