0

I know I can use sort --unique to remove the duplicate rows in a text file (or on the standard input). But - what if I want to maintain the original order of rows?

I know that if the duplicates happen to be consecutive, uniq does the trick; but in my case, the duplicates might be farther apart from each other.

Also, I realize I can write a small program to do this in C, or perhaps in Python - but I would like to do that with bash. A naive solution would be using a bash dictionary as a set and adding lines into there... but I doubt this would scale very well.

Just to illustrate:

original file after duplicate removal
one
two
five
two
two
four
one
two
five
four


einpoklum
  • 118,144
  • 57
  • 340
  • 684

1 Answers1

1
awk '!map[$0] { print } { map[$0]="1" }' file

Using awk, create an array called map with the line as the index. We print the line only if there is not an entry for the line in the array.

Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18
  • This is a neat solution, in terms of terseness, but it requires saving a copy of the whole file/input stream in memory. That's pretty expensive. Anyway, +1. – einpoklum Jan 20 '21 at 18:14