Awk command to find duplicate records in columns of one file

Question

I have record.txt file. And the file contains...

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00
Kelly,20,female,City,89.00
Timmy,21,male,City,88.00
Tom,22,male,City,90.00

I only want to check duplicate records for first to fourth column and print the output

Sample output :

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00

I've tried this code to get only the duplicates in column 1 to 4, but I don't know how to print it with column 5

awk -F"," '{print $1","$2","$3","$4}' record.txt | sort | uniq -D

What I get is

Danna,20,female,City
Danna,20,female,City
Jason,22,male,City
Jason,22,male,City

What I need to get is this

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00

SO is not meant to code for you, but to solve specific coding problems. Try on your side. Once you meet a specific problem, come again and we'll hapilly help you to solve it. — Itération 122442, Sep 01 '23 at 07:26
Looping over lines is trivial, as is getting 4 values from columns. You could concatenate all 4 columns to 1 string and check in an awk array if already seen. Ex: make a string of `Danna,20,female,City`. Your issue will of course be "how to get first value", since it isn't duplicated first time you see it. Maybe loop over lines twice, like [described here](https://stackoverflow.com/questions/28544105/awk-go-through-the-file-twice-doing-different-tasks) — MyICQ, Sep 01 '23 at 07:47

David C. Rankin · Answer 1 · 2023-09-01T08:35:06.593

You can delve into the SUBSEP variable of awk and use fields 1-4 as the index of an array to check for duplicates with the i in array expression. For example:

awk -F, '($1,$2,$3,$4) in a {if(a[$1,$2,$3,$4]==1) print rec[$1,$2,$3,$4]; print} {a[$1,$2,$3,$4]++; rec[$1,$2,$3,$4]=$0}' file

Where above a[] is the array and the , is shorthand for the SUBSEP variable. The ($1,...$4) forces evaluation of the combined indexes. The ==1 simply outputs the original in rec[] and the duplicate when the first duplicate is found and thereafter only duplicates are output.

Example Use/Output

With your example data in file you would have:

awk -F, '($1,$2,$3,$4) in a {if(a[$1,$2,$3,$4]==1) print rec[$1,$2,$3,$4]; print} {a[$1,$2,$3,$4]++; rec[$1,$2,$3,$4]=$0}' file
Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00

There are many, many ways you can skin-this-cat in awk. You can also use the ! ($1,$2,$3,$4) in a expression to find records not duplicated with a little rearrangement.

The fourth bird · Answer 2 · 2023-09-01T09:07:24.520

You could keep track of the first occurrence and check for a duplicate. If it is a duplicate, then print the first occurrence and the current line.

awk -F, '{
  key = $1 FS $2 FS $3 FS $4
  if (a[key]++) {
    if (key in b) { print b[key]; delete b[key] }
    print $0
  } else { 
    b[key] = $0 
  }
}' file

Output

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00

score 2 · Answer 3 · answered Sep 01 '23 at 11:41

2

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    key = $1 FS $2 FS $3 FS $4

    if ( key in first ) {
        print first[key] $0
        first[key] = ""
    }
    else {
        first[key] = $0 ORS
    }
}

$ awk -f tst.awk record.txt
Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00

answered Sep 01 '23 at 11:41

Ed Morton

188,023
17
78
185

1

Ha! That is pretty cool printing the empty string. I was looking for a solution like that, but you are just a guru :-) – The fourth bird Sep 01 '23 at 17:08

Marius_Couet · Answer 4 · 2023-09-01T08:08:55.513

Using awk :

script.awk

BEGIN{FS=OFS=","}

{
    k=$1 OFS $2 OFS $3 OFS $4
    a[k] = k in a ? a[k] OFS $5 : $5
}

END{for(key in a) if(split(a[key], b, OFS)>1) for (i in b) print key, b[i]}

BEGIN{FS=OFS=","} set field separator FS and output field separator OFS to ,
k=$1 OFS $2 OFS $3 OFS $4 create associative key for array next line
a[k] = k in a ? a[k] OFS $5 : $5 array a with key k take value a[k] OFS $5 if the key k is already in a, otherwise take value $5
END{} at the end of the last file
for(key in a) for each value in a
if(split(a[key], b, OFS)>1) if the returned value of split is more than 1

score 0 · Answer 5 · answered Sep 01 '23 at 08:18

0

With awk multidimensional array, using the first 4 fields as a key and the last field - as a value:

awk '{ last=$NF; $NF=""; size=length(a[$0]); a[$0][size++]=last }
     END{ for (i in a) { 
             if (length(a[i]) == 1) continue; # skip unique lines
             for(j in a[i]) print i""a[i][j]
          }
     }' FS=, OFS=, test.txt

Jason,22,male,City,90.00
Jason,22,male,City,80.00
Danna,20,female,City,80.00
Danna,20,female,City,90.00

answered Sep 01 '23 at 08:18

RomanPerekhrest

88,541
4
65
105

I don't doubt your solution, but would that not force all file (first 4 fields) to be in memory, including unique lines ? Would it be beneficial to store just duplicates and run over file x2 ? Just curious – MyICQ Sep 01 '23 at 08:37
@MyICQ, "to store just duplicates" - how can you know if some is duplicated without scanning to the end if say one record is the 1st line and its duplicate is the last line ? – RomanPerekhrest Sep 01 '23 at 10:55
Hard to know without running over the file multiple times. – MyICQ Sep 01 '23 at 11:11

dawg · Answer 6 · 2023-09-01T19:11:13.190

Here is a Ruby:

ruby -lne 'BEGIN{lines=Hash.new { |h, k| h[k] = [] }}
lines[$_.split(/,/)[0..3]]<<$_
END{lines.each{|k,v| puts v if v.length>1}}' file

You can also do a two pass awk so that the entire file does not need to be in memory (all the unique keys would be in memory):

awk 'BEGIN{FS=OFS=","}
{key=$1 FS $2 FS $3 FS $4}
FNR==NR {cnt[key]++; next}
cnt[key]>1' file file

Or sorted if you want lines that are not together to be printed together:

awk 'BEGIN{FS=OFS=","}
{key=$1 FS $2 FS $3 FS $4}
FNR==NR {cnt[key]++; next}
cnt[key]>1' file <(sort -t, -s -k 1,4 file)

Or if you are short on memory, use the Unix tools (which are more optimized for memory use) to pre-select duplicates and use awk to print them:

awk 'BEGIN{FS=OFS=","}
{key=$1 FS $2 FS $3 FS $4}
FNR==NR{seen[key]; next}
key in seen
' <(cut -d , -f 1-4 file | uniq -d) file

Alternatively, you can use grep with fixed strings to find the duplicates:

grep -F -f <(cut -d , -f 1-4 file | uniq -d ) file

Any of those print:

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00

Awk command to find duplicate records in columns of one file

6 Answers6

Linked