awk
CLI that behaves like uniq
without sort
but only catches consecutive dupes
Most other answers so far have given methods that remove duplicates even when they are not consecutive.
The problem with this is that it requires either first sorting or storing potentially huge map in memory, which could be slow/unfeasible for large input files.
So for those cases here's an awk
solution that like uniq
only catches duplicates if they appear on consecutive lines. E.g. to remove all consecutive duplicates on the first column we can use $1
as in:
awk '$1 != last { print $0; last = $1; }' infile.txt
For example, considering the input file:
a 0
a 1
b 0
a 0
a 1
the output would be:
a 0
b 0
a 0
Here:
- the first
a 1
column was removed because the previous a 0
row has a duplicate first column a
- but we get a second
a 0
column because the b 0
row broke the consecutiveness
The awk
script works simply by storing the value of the column for previous line in the last
value and comparing the current value to it, skipping if they are different.
This consecutive-only approach can be useful if you know your input data has a lot of useless consecutive dupes, and want to clean that down a bit before doing any more expensive sort-like processing.
The more robust solution if you really need to remove non-consecutive duplicates is generally to use a relational database like SQLite, e.g.: how can I delete duplicates in SQLite?
Quick Python script to remove duplicates that appears on the last N lines
If you need a bit more flexibility but still don't want to pay for the full sort:
uniqn
#!/usr/bin/env python
import argparse
from argparse import RawTextHelpFormatter
import fileinput
import sys
parser = argparse.ArgumentParser(
description='uniq but with a memory of the n previous distinct lines rather than just one',
epilog="""Useful if you know that duplicate lines in an input file are nearby to one another, but not necessarily immediately one afte the other.
This command was about 3x slower than uniq, and becomes highly CPU (?) bound even on rotating disks. We need to make a C++ version one day, or try PyPy/Cython""",
formatter_class=RawTextHelpFormatter,
)
parser.add_argument("-k", default=None, type=int)
parser.add_argument("-n", default=10, type=int)
parser.add_argument("file", nargs='?', default=[])
args = parser.parse_args()
k = args.k
lastlines = {}
for line in fileinput.input(args.file):
line = line.rstrip('\r\n')
if k is not None:
orig = line
line = line.split()[k]
else:
orig = line
if not line in lastlines:
print(orig)
lastlines.pop(line, None)
lastlines[line] = True
if len(lastlines) == args.n + 1:
del lastlines[next(iter(lastlines))]
This script looks for duplicates on the previous -n
lines, and can be useful to clean data that has some kind of periodic pattern preventing uniq
from doing much to it. -k
selects the column. e.g. consider the input file:
uniqn-test
1 a
2 a
3 a
1 a
2 a
2 b
3 a
Then:
./uniqn -k0 -n3 uniqn-test
gives:
1 a
2 a
3 a
E.g. the second 1 a
sees the first 1 a
three lines back and skips it as a result of -n3
.
Some built-in uniq
options to consider
Although uniq
doesn't have a nice "consider only N-th" column, it does have some flags that might solve certain more restricted cases, from man uniq
:
-f, --skip-fields=N: avoid comparing the first N fields
-s, --skip-chars=N: avoid comparing the first N characters
-w, --check-chars=N: compare no more than N characters in lines
A field is a run of blanks (usually spaces and/or TABs), then non-blank characters. Fields are skipped before chars.
If only someone would patch a --check-fields
analogous to --check-chars
into it then we'd be done with --skip-fields N-1 --check-fields 1
. It already works for the specific case of the first field however.
Tested on Ubuntu 23.04.