Deduplicating a Text File and keeping the last occurence in one output file and moving others to another output file

Question

I have a file with dups records (dups are in columns). I want to keep only the last occurrence of the dup records in a file and move the all other dups in another file.

File : input

foo j
bar bn
bar b
bar bn
bar bn
bar bn
kkk hh
fjk ff
foo jj
xxx tt
kkk hh

I have used the following awk statement to keep the last occurrence --

awk '{line=$0; x[$1]=line;} END{ for (key in x) print x[key];}' input > output

File : output

foo jj
xxx tt
fjk ff
kkk hh
bar bn

How can I move the repeating records to another file (leaving the last occurrence)?

Moving foo j in one file let say d_output and keeping foo jj in output file

Chris Seymour · Answer 1 · 2013-03-16T23:11:31.807

2

A trick is to used tac to reverse the file first (easier to grab first match than last):

$ tac file | awk 'a[$1]++{print $0 > "dup";next}{print $0 > "output"}'

$ cat output
kkk hh
xxx tt
foo jj
fjk ff
bar bn

$ cat dup
kkk hh
bar bn
bar bn
bar b
bar bn
foo j

Edit:

Here are the benchmark figures for the current 3 solutions over one million lines:

sudo_o

real    0m2.156s
user    0m1.004s
sys     0m0.117s

kent

real    0m2.806s
user    0m2.718s
sys     0m0.080s

scrutinizer

real    0m4.033s
user    0m3.939s
sys     0m0.082s

Verify here http://ideone.com/IBrNeh

On my local machine using the file seq 1 1000000 > bench:

# sudo_o
$ time tac bench | awk 'a[$1]++{print $0 > "dup";next}{print $0 > "output"}' 

real    0m0.729s
user    0m0.668s
sys     0m0.101s

# scrutinizer
$ time awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=dups bench bench > output

real    0m1.093s
user    0m1.016s
sys     0m0.070s

# kent 
$ time awk '$1 in a{print a[$1]>"dup.txt"}{a[$1]=$0}END{for(x in a)print a[x]}' bench > output

real    0m1.141s
user    0m1.055s
sys     0m0.080s

edited Mar 16 '13 at 23:11

answered Mar 15 '13 at 21:03

Chris Seymour

83,387
30
160
202

thanks Sudo_O. What will be the performance for large files like 1GB? – user2018441 Mar 16 '13 at 04:16
The awk is only looking at each line once so it should perform very well, scrutinizers answer has to pass the fill twice so it's probably not the best for large files. – Chris Seymour Mar 16 '13 at 11:06
In this solution, the file is also read twice (tac once, and awk once). – William Pursell Mar 16 '13 at 13:12
@WilliamPursell yes but the `tac` is negligible. Scrutinizer's solution is ~2 slower than mine. Here are the benchmark figures http://ideone.com/IBrNeh – Chris Seymour Mar 16 '13 at 13:41
@Scrutinizer I have added the figures from my local machine using files and your solution performs on par with Kent's but I still twice as slow as mine. *F.Y.I* the use of `seq 1 1000000` and `/dev/null` is to get round the file permissions limitation of ideone. – Chris Seymour Mar 16 '13 at 15:06
OK, after the previous test on OSX, I also tested on an AIX box and with files my solution was again about twice as fast and a lttle bit slower then the others when using the `seq`method, probably maybe because it has to do that twice.. I also noticed the your solution has more system tim, perhaps because of the pipe. Did you use gawk? I used regular nawk and bwk, maybe that also has something to do with the contradictory results... – Scrutinizer Mar 16 '13 at 16:14
@sudo_O I just noticed that I downvoted your answer?!! sorry about that. I dont know how is this happed!, maybe I checked this question out with my smartphone, and touched downvote arrow?? no idea.. sorry about that. I make it back right now. – Kent Mar 16 '13 at 23:09
@sudo_O I am not allowed to do it any longer... can you somehow edit your answer, whatever, so that I can cancel my downvote? thx. – Kent Mar 16 '13 at 23:11
@Kent I edited the answer so you can fixed your vote. Thanks, I didn't think this answer deserved a downvote, glad is was only a mistake. – Chris Seymour Mar 16 '13 at 23:12

score 2 · Answer 2 · answered Mar 15 '13 at 21:43

2

Tools like tac and rev are nice!. However they are not default for all distributions, particularly I found you have tagged the question with unix. Also tac changes the output/dup.txt order, if the order should be kept, there is extra efforts to maintain the order.

Try this line:

awk '$1 in a{print a[$1]>"dup.txt"}{a[$1]=$0}END{for(x in a)print a[x]}' file

with your example:

kent$  awk '$1 in a{print a[$1]>"dup.txt"}{a[$1]=$0}END{for(x in a)print a[x]}' file
foo jj
xxx tt
fjk ff
kkk hh
bar bn

kent$  cat dup.txt 
bar bn
bar b
bar bn
bar bn
foo j
kkk hh

answered Mar 15 '13 at 21:43

Kent

189,393
32
233
301

1

The order of associative arrays in awk is arbitrary so your point about order is moot. – Chris Seymour Mar 15 '13 at 22:26
dup.txt too? there is a slash (or). – Kent Mar 15 '13 at 22:28

Scrutinizer · Accepted Answer · 2013-03-16T15:46:00.813

Another option you could try, keeping the order by reading the input file twice:

awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=dups file file

output:

bar bn
fjk ff
foo jj
xxx tt
kkk hh

Duplicates:

$ cat dups
foo j
bar bn
bar b
bar bn
bar bn
kkk hh

@Sudo_O @WilliamPursell @user2018441. Sudo_O thank you for the performance test. I tried to reproduce them on my system, but it does not have tac available, so I tested with Kent's version and mine, but I could not reproduce those differences on my system.

Update: I tested with Sudo_O's version using cat instead of tac. Although on a system with tac there was a difference of 0,2 seconds between tac and cat when outputting to /dev/null (see at the bottom of this post)

I got:

Sudo_O
$ time cat <(seq 1 1000000) | awk 'a[$1]++{print $0 > "/dev/null";next}{print $0 > "/dev/null"}'

real    0m1.491s
user    0m1.307s
sys     0m0.415s

kent
$ time awk '$1 in a{print a[$1]>"/dev/null"}{a[$1]=$0}END{for(x in a)print a[x]}' <(seq 1 1000000) > /dev/null

real    0m1.238s
user    0m1.421s
sys     0m0.038s

scrutinizer
$ time awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=/dev/null <(seq 1 1000000) <(seq 1 1000000) > /dev/null

real    0m1.422s
user    0m1.778s
sys     0m0.078s

--

when using a file instead of the seq I got:

Sudo_O
$ time cat <infile | awk 'a[$1]++{print $0 > "/dev/null";next}{print $0 > "/dev/null"}'

real    0m1.519s
user    0m1.148s
sys     0m0.372s


kent
$ time awk '$1 in a{print a[$1]>"/dev/null"}{a[$1]=$0}END{for(x in a)print a[x]}' <infile > /dev/null

real    0m1.267s
user    0m1.227s
sys     0m0.037s

scrutinizer
$ time awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=/dev/null <infile <infile > /dev/null

real    0m0.737s
user    0m0.707s
sys     0m0.025s

Probably due to caching effects, which would be present also for larger files.. Creating the infile took:

$ time seq 1 1000000 > infile

real    0m0.224s
user    0m0.213s
sys     0m0.010s

Tested on a different system:

$ time cat <(seq 1 1000000) > /dev/null

real    0m0.764s
user    0m0.719s
sys     0m0.031s
$ time tac <(seq 1 1000000) > /dev/null

real    0m1.011s
user    0m0.820s
sys     0m0.082s

I assume the performance penalty of `tac | ...` will be of the same order of `cat | ...`. Should give rough *(independent)* estimation of how my solutions performs. — Chris Seymour, Mar 16 '13 at 15:09

Deduplicating a Text File and keeping the last occurence in one output file and moving others to another output file

3 Answers3