1

I have record.txt file. And the file contains...

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00
Kelly,20,female,City,89.00
Timmy,21,male,City,88.00
Tom,22,male,City,90.00

I only want to check duplicate records for first to fourth column and print the output

Sample output :

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00

I've tried this code to get only the duplicates in column 1 to 4, but I don't know how to print it with column 5

awk -F"," '{print $1","$2","$3","$4}' record.txt | sort | uniq -D

What I get is

Danna,20,female,City
Danna,20,female,City
Jason,22,male,City
Jason,22,male,City

What I need to get is this

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00
Kisses
  • 43
  • 4
  • What did you try and how did it fail> – Inian Sep 01 '23 at 07:13
  • SO is not meant to code for you, but to solve specific coding problems. Try on your side. Once you meet a specific problem, come again and we'll hapilly help you to solve it. – Itération 122442 Sep 01 '23 at 07:26
  • I've edited my question – Kisses Sep 01 '23 at 07:33
  • @Kisses, would there be only 5 fields in the file? – RomanPerekhrest Sep 01 '23 at 07:42
  • Yes. There are only 5 fields in the file – Kisses Sep 01 '23 at 07:46
  • Looping over lines is trivial, as is getting 4 values from columns. You could concatenate all 4 columns to 1 string and check in an awk array if already seen. Ex: make a string of `Danna,20,female,City`. Your issue will of course be "how to get first value", since it isn't duplicated first time you see it. Maybe loop over lines twice, like [described here](https://stackoverflow.com/questions/28544105/awk-go-through-the-file-twice-doing-different-tasks) – MyICQ Sep 01 '23 at 07:47

6 Answers6

2

You can delve into the SUBSEP variable of awk and use fields 1-4 as the index of an array to check for duplicates with the i in array expression. For example:

awk -F, '($1,$2,$3,$4) in a {if(a[$1,$2,$3,$4]==1) print rec[$1,$2,$3,$4]; print} {a[$1,$2,$3,$4]++; rec[$1,$2,$3,$4]=$0}' file

Where above a[] is the array and the , is shorthand for the SUBSEP variable. The ($1,...$4) forces evaluation of the combined indexes. The ==1 simply outputs the original in rec[] and the duplicate when the first duplicate is found and thereafter only duplicates are output.

Example Use/Output

With your example data in file you would have:

awk -F, '($1,$2,$3,$4) in a {if(a[$1,$2,$3,$4]==1) print rec[$1,$2,$3,$4]; print} {a[$1,$2,$3,$4]++; rec[$1,$2,$3,$4]=$0}' file
Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00

There are many, many ways you can skin-this-cat in awk. You can also use the ! ($1,$2,$3,$4) in a expression to find records not duplicated with a little rearrangement.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
2

You could keep track of the first occurrence and check for a duplicate. If it is a duplicate, then print the first occurrence and the current line.

awk -F, '{
  key = $1 FS $2 FS $3 FS $4
  if (a[key]++) {
    if (key in b) { print b[key]; delete b[key] }
    print $0
  } else { 
    b[key] = $0 
  }
}' file

Output

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
2
$ cat tst.awk
BEGIN { FS=OFS="," }
{
    key = $1 FS $2 FS $3 FS $4

    if ( key in first ) {
        print first[key] $0
        first[key] = ""
    }
    else {
        first[key] = $0 ORS
    }
}

$ awk -f tst.awk record.txt
Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

Using awk :

script.awk

BEGIN{FS=OFS=","}

{
    k=$1 OFS $2 OFS $3 OFS $4
    a[k] = k in a ? a[k] OFS $5 : $5
}

END{for(key in a) if(split(a[key], b, OFS)>1) for (i in b) print key, b[i]}
  • BEGIN{FS=OFS=","} set field separator FS and output field separator OFS to ,
  • k=$1 OFS $2 OFS $3 OFS $4 create associative key for array next line
  • a[k] = k in a ? a[k] OFS $5 : $5 array a with key k take value a[k] OFS $5 if the key k is already in a, otherwise take value $5
  • END{} at the end of the last file
  • for(key in a) for each value in a
  • if(split(a[key], b, OFS)>1) if the returned value of split is more than 1
Marius_Couet
  • 190
  • 7
0

With awk multidimensional array, using the first 4 fields as a key and the last field - as a value:

awk '{ last=$NF; $NF=""; size=length(a[$0]); a[$0][size++]=last }
     END{ for (i in a) { 
             if (length(a[i]) == 1) continue; # skip unique lines
             for(j in a[i]) print i""a[i][j]
          }
     }' FS=, OFS=, test.txt

Jason,22,male,City,90.00
Jason,22,male,City,80.00
Danna,20,female,City,80.00
Danna,20,female,City,90.00
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • I don't doubt your solution, but would that not force all file (first 4 fields) to be in memory, including unique lines ? Would it be beneficial to store just duplicates and run over file x2 ? Just curious – MyICQ Sep 01 '23 at 08:37
  • @MyICQ, "to store just duplicates" - how can you know if some is duplicated without scanning to the end if say one record is the 1st line and its duplicate is the last line ? – RomanPerekhrest Sep 01 '23 at 10:55
  • Hard to know without running over the file multiple times. – MyICQ Sep 01 '23 at 11:11
0

Here is a Ruby:

ruby -lne 'BEGIN{lines=Hash.new { |h, k| h[k] = [] }}
lines[$_.split(/,/)[0..3]]<<$_
END{lines.each{|k,v| puts v if v.length>1}}' file  

You can also do a two pass awk so that the entire file does not need to be in memory (all the unique keys would be in memory):

awk 'BEGIN{FS=OFS=","}
{key=$1 FS $2 FS $3 FS $4}
FNR==NR {cnt[key]++; next}
cnt[key]>1' file file

Or sorted if you want lines that are not together to be printed together:

awk 'BEGIN{FS=OFS=","}
{key=$1 FS $2 FS $3 FS $4}
FNR==NR {cnt[key]++; next}
cnt[key]>1' file <(sort -t, -s -k 1,4 file) 

Or if you are short on memory, use the Unix tools (which are more optimized for memory use) to pre-select duplicates and use awk to print them:

awk 'BEGIN{FS=OFS=","}
{key=$1 FS $2 FS $3 FS $4}
FNR==NR{seen[key]; next}
key in seen
' <(cut -d , -f 1-4 file | uniq -d) file

Alternatively, you can use grep with fixed strings to find the duplicates:

grep -F -f <(cut -d , -f 1-4 file | uniq -d ) file 

Any of those print:

Danna,20,female,City,80.00
Danna,20,female,City,90.00
Jason,22,male,City,90.00
Jason,22,male,City,80.00
dawg
  • 98,345
  • 23
  • 131
  • 206