How to remove duplicate lines while maintaining row order?

Question

I know I can use sort --unique to remove the duplicate rows in a text file (or on the standard input). But - what if I want to maintain the original order of rows?

I know that if the duplicates happen to be consecutive, uniq does the trick; but in my case, the duplicates might be farther apart from each other.

Also, I realize I can write a small program to do this in C, or perhaps in Python - but I would like to do that with bash. A naive solution would be using a bash dictionary as a set and adding lines into there... but I doubt this would scale very well.

Just to illustrate:

original file	after duplicate removal
one two five two two four	one two five four

Can you add a sample? – Raman Sailopal Jan 20 '21 at 16:59 — Raman Sailopal, Jan 20 '21 at 16:59

score 1 · Answer 1 · answered Jan 20 '21 at 17:20

1

awk '!map[$0] { print } { map[$0]="1" }' file

Using awk, create an array called map with the line as the index. We print the line only if there is not an entry for the line in the array.

answered Jan 20 '21 at 17:20

Raman Sailopal

12,320
2
11
18

This is a neat solution, in terms of terseness, but it requires saving a copy of the whole file/input stream in memory. That's pretty expensive. Anyway, +1. – einpoklum Jan 20 '21 at 18:14

How to remove duplicate lines while maintaining row order?

1 Answers1