Is there a way to 'uniq' by column?

Question

I have a .csv file like this:

stack2@domain.example,2009-11-27 01:05:47.893000000,domain.example,127.0.0.1
overflow@domain2.example,2009-11-27 00:58:29.793000000,domain2.example,255.255.255.0
overflow@domain2.example,2009-11-27 00:58:29.646465785,domain2.example,256.255.255.0
...

I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing overflow@domain2.example in the above example). How do I use uniq on only field 1 (separated by commas)? According to man, uniq doesn't have options for columns.

I tried something with sort | uniq but it doesn't work.

score 404 · Accepted Answer · edited Jun 20 '22 at 10:42

404

sort -u -t, -k1,1 file

-u for unique
-t, so comma is the delimiter
-k1,1 for the key field 1

Test result:

overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

edited Jun 20 '22 at 10:42

Stephen Ostermiller

23,933
14
88
109

answered Dec 16 '09 at 16:14

Carl Smotricz

66,391
18
125
167

23

why do you need the ,1 in -k1,1? why not just -k1? – hello_there_andy Nov 24 '14 at 20:10
26

@hello_there_andy: This is explained in the manual (`man sort`). It stands for the start and stop position. – Serrano Jan 27 '15 at 13:21
How does it decide which line with a duplicate field to output? Is it the first occurrence of the duplicate before sorting? – Geremia Apr 13 '16 at 20:15
6

@CarlSmotricz: I tested it and it confirmed what `sort`'s manpage says: "`-u`, `--unique` with `-c`, check for strict ordering; without `-c`, **output only the first of an equal run**." So, it is indeed "the first occurrence of the duplicate before sorting." – Geremia Apr 15 '16 at 17:32
2

this changes the order of the lines as well, doesn't it? – rkachach Mar 20 '19 at 10:33
3

It does answer the specific question, but the title doesn't reflect that - ie there are other options to 'uniq' that 'sort -u' doesn't apply to - eg simply reporting which lines are duplicated (and not produce output for lines that are unique). I wonder why 'uniq' has a '--skip-fields=N' option, but does not have an option to select which field to compare...it seems like an obvious thing to have. – Max Waterman Nov 21 '19 at 07:53

score 132 · Answer 2 · edited May 14 '14 at 12:35

132

awk -F"," '!_[$1]++' file

-F sets the field separator.
$1 is the first field.
_[val] looks up val in the hash _(a regular variable).
++ increment, and return old value.
! returns logical not.
there is an implicit print at the end.

edited May 14 '14 at 12:35

Pepijn

4,145
5
36
64

answered Dec 16 '09 at 17:17

ghostdog74

327,991
56
259
343

6

This approach is two times faster than sort – Alex Bitek Feb 17 '15 at 21:12
12

This also has the additional benefit of keeping the lines in the original order! – AffluentOwl Mar 10 '15 at 00:21
9

If you need the *last* uniq instead of the first then this awk script will help: `awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }' file` – Sukima Oct 01 '15 at 17:36
4

@eshwar just add more fields to the dictionary index! For instance, `!_[$1][$2]++` can be used to sort by the first two fields. My `awk`-fu isn't strong enough to be able to unique on a range of fields, though. :( – Soham Chowdhury Jan 29 '17 at 06:43
Note — you want use this solution if you are reading from stdin: ` | awk -F"," '!_[$1]++' -` – marek094 Jul 01 '18 at 15:27
1

Brilliant! this option is better than the answer because it keeps the lines order – rkachach Mar 20 '19 at 11:55
1

This solution has pros but also cons : it keeps the first column in memory for each line : could be greedier than sort -u for big files. – Corentin Limier Jul 02 '19 at 16:18
If you are seeing _[: Event not found error, escape the ! with \ – Plutonium smuggler Aug 01 '22 at 08:33
If you want multiple fields, you can use `cat my-file.tsv | awk -F $'\t' '!_[$3,$19]++'` where I'm using fields 3 and 19 to select uniques to deduplicate. Note I'm using tab delimited values instead of comma. – James T. Aug 03 '22 at 19:07

score 23 · Answer 3 · edited Oct 28 '15 at 20:13

23

To consider multiple column.

Sort and give unique list based on column 1 and column 3:

sort -u -t : -k 1,1 -k 3,3 test.txt

-t : colon is separator
-k 1,1 -k 3,3 based on column 1 and column 3

edited Oct 28 '15 at 20:13

Nicolás Ozimica

9,481
5
38
51

answered Jan 21 '14 at 01:30

Prakash

4,479
30
42

score 8 · Answer 4 · edited Jun 20 '22 at 10:42

8

If you want to use uniq:

<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2

gives:

1 01:05:47.893000000 2009-11-27 tack2@domain.example
2 00:58:29.793000000 2009-11-27 overflow@domain2.example
1

edited Jun 20 '22 at 10:42

Stephen Ostermiller

23,933
14
88
109

answered Dec 16 '09 at 16:18

Carsten C.

211
1
7

5

I'd like to point out a possible simplification: You can dump the `cat`! Rather than piping into tr, just let tr read the file using `<`. Piping through `cat` is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had. – Carl Smotricz Dec 16 '09 at 16:27
The reversing of fields can be simplified with `rev`. – Hielke Walinga Jul 09 '19 at 15:02
@HielkeWalinga I thought `rev` reverses the **characters** in each line, not fields?! – Fixee Dec 18 '21 at 20:55
1

@Fixee Yes but in that way also the order of the fields, and it doesn't matter for the uniqueness the fields that the characters are reversed. So like this: ` – Hielke Walinga Dec 19 '21 at 22:04

score 7 · Answer 5 · answered Dec 14 '15 at 13:46

7

If you want to retain the last one of the duplicates you could use

 tac a.csv | sort -u -t, -r -k1,1 |tac

Which was my requirement

here

tac will reverse the file line by line

answered Dec 14 '15 at 13:46

Sumukh

660
7
11

NOYB · Answer 6 · 2017-04-27T08:49:09.097

Here is a very nifty way.

First format the content such that the column to be compared for uniqueness is a fixed width. One way of doing this is to use awk printf with a field/column width specifier ("%15s").

Now the -f and -w options of uniq can be used to skip preceding fields/columns and to specify the comparison width (column(s) width).

Here are three examples.

In the first example...

1) Temporarily make the column of interest a fixed width greater than or equal to the field's max width.

2) Use -f uniq option to skip the prior columns, and use the -w uniq option to limit the width to the tmp_fixed_width.

3) Remove trailing spaces from the column to "restore" it's width (assuming there were no trailing spaces beforehand).

printf "%s" "$str" \
| awk '{ tmp_fixed_width=15; uniq_col=8; w=tmp_fixed_width-length($uniq_col); for (i=0;i<w;i++) { $uniq_col=$uniq_col" "}; printf "%s\n", $0 }' \
| uniq -f 7 -w 15 \
| awk '{ uniq_col=8; gsub(/ */, "", $uniq_col); printf "%s\n", $0 }'

In the second example...

Create a new uniq column 1. Then remove it after the uniq filter has been applied.

printf "%s" "$str" \
| awk '{ uniq_col_1=4; printf "%15s %s\n", uniq_col_1, $0 }' \
| uniq -f 0 -w 15 \
| awk '{ $1=""; gsub(/^ */, "", $0); printf "%s\n", $0 }'

The third example is the same as the second, but for multiple columns.

printf "%s" "$str" \
| awk '{ uniq_col_1=4; uniq_col_2=8; printf "%5s %15s %s\n", uniq_col_1, uniq_col_2, $0 }' \
| uniq -f 0 -w 5 \
| uniq -f 1 -w 15 \
| awk '{ $1=$2=""; gsub(/^ */, "", $0); printf "%s\n", $0 }'

Ciro Santilli OurBigBook.com · Answer 7 · 2023-08-02T06:28:50.317

awk CLI that behaves like uniq without sort but only catches consecutive dupes

Most other answers so far have given methods that remove duplicates even when they are not consecutive.

The problem with this is that it requires either first sorting or storing potentially huge map in memory, which could be slow/unfeasible for large input files.

So for those cases here's an awk solution that like uniq only catches duplicates if they appear on consecutive lines. E.g. to remove all consecutive duplicates on the first column we can use $1 as in:

awk '$1 != last { print $0; last = $1; }' infile.txt

For example, considering the input file:

a 0
a 1
b 0
a 0
a 1

the output would be:

a 0
b 0
a 0

Here:

the first a 1 column was removed because the previous a 0 row has a duplicate first column a
but we get a second a 0 column because the b 0 row broke the consecutiveness

The awk script works simply by storing the value of the column for previous line in the last value and comparing the current value to it, skipping if they are different.

This consecutive-only approach can be useful if you know your input data has a lot of useless consecutive dupes, and want to clean that down a bit before doing any more expensive sort-like processing.

The more robust solution if you really need to remove non-consecutive duplicates is generally to use a relational database like SQLite, e.g.: how can I delete duplicates in SQLite?

Quick Python script to remove duplicates that appears on the last N lines

If you need a bit more flexibility but still don't want to pay for the full sort:

uniqn

#!/usr/bin/env python

import argparse
from argparse import RawTextHelpFormatter
import fileinput
import sys

parser = argparse.ArgumentParser(
    description='uniq but with a memory of the n previous distinct lines rather than just one',
    epilog="""Useful if you know that duplicate lines in an input file are nearby to one another, but not necessarily immediately one afte the other.

This command was about 3x slower than uniq, and becomes highly CPU (?) bound even on rotating disks. We need to make a C++ version one day, or try PyPy/Cython""",
    formatter_class=RawTextHelpFormatter,
)
parser.add_argument("-k", default=None, type=int)
parser.add_argument("-n", default=10, type=int)
parser.add_argument("file", nargs='?', default=[])
args = parser.parse_args()
k = args.k

lastlines = {}
for line in fileinput.input(args.file):
    line = line.rstrip('\r\n')
    if k is not None:
        orig = line
        line = line.split()[k]
    else:
        orig = line
    if not line in lastlines:
        print(orig)
    lastlines.pop(line, None)
    lastlines[line] = True
    if len(lastlines) == args.n + 1:
        del lastlines[next(iter(lastlines))]

This script looks for duplicates on the previous -n lines, and can be useful to clean data that has some kind of periodic pattern preventing uniq from doing much to it. -k selects the column. e.g. consider the input file:

uniqn-test

1 a
2 a
3 a
1 a
2 a
2 b
3 a

Then:

./uniqn -k0 -n3 uniqn-test

gives:

1 a
2 a
3 a

E.g. the second 1 a sees the first 1 a three lines back and skips it as a result of -n3.

Some built-in uniq options to consider

Although uniq doesn't have a nice "consider only N-th" column, it does have some flags that might solve certain more restricted cases, from man uniq:

-f, --skip-fields=N: avoid comparing the first N fields

-s, --skip-chars=N: avoid comparing the first N characters

-w, --check-chars=N: compare no more than N characters in lines

A field is a run of blanks (usually spaces and/or TABs), then non-blank characters. Fields are skipped before chars.

If only someone would patch a --check-fields analogous to --check-chars into it then we'd be done with --skip-fields N-1 --check-fields 1. It already works for the specific case of the first field however.

Tested on Ubuntu 23.04.

what performance do u need ? `out9: 1.85GiB 0:00:15 [ 125MiB/s] [ 125MiB/s] [ <=> ] ( pvE 0.1 in0 < "$m3t" | mawk2 '!__[$_]++' FS='\n'; ) 11.70s user 1.53s system 87% cpu 15.158 total rows = 12494275. | UTF8 chars = 1285316715. | bytes = 1983544693.` `awk` full-row de-duped `12.5 million` rows spanning `1.85 GiB` in `15.2 seconds`. ps : turns out it's my custom `wc` that slowed it down. without it it's 12.35 seconds to de-dupe nearly `1.9 GiB` of Unicode multi-byte text — RARE Kpop Manifesto, Jul 03 '23 at 21:38
Tried de-duping a `148,156,631 rows` `7.59 GiB` `ASCII` text file - then finally `sort | uniq` came out ahead of `awk` by `4.5 %` - `148 secs v. 155` — RARE Kpop Manifesto, Jul 03 '23 at 21:47
@RAREKpopManifesto and it is easy to go way above 7 GiB. Sure for small files, `awk` map is fine. But when we get into 100s of GBs, it will blow up memory, and sort (which uses temporary files to not blow up memory) will likely be slower than `awk` without map. — Ciro Santilli OurBigBook.com, Jul 04 '23 at 07:01
that's `7 GB` of full row de-duping. De-duping by reasonably sized keys, `awk` can easily handle `100 GiB` without wasting time sorting. You can always pre-pipe it via `python` and do make `awk` simply do full-row de-duping via something like `SHA256` hash of the row. The greatest sorts are still `n log n`, give or take, so I always `sort` ***after*** filtering, not before, whenever applicable. — RARE Kpop Manifesto, Jul 04 '23 at 08:55
most surprisingly, at least in my `7 GiB` file, pre-piping it to `python3` to do row-level `MD5` ended up slowing it down by `9 secs` — RARE Kpop Manifesto, Jul 04 '23 at 11:16

score -2 · Answer 8 · edited Jun 20 '22 at 10:46

By sorting the file with sort first, you can then apply uniq.

It seems to sort the file just fine:

$ cat test.csv
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

$ sort test.csv
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

$ sort test.csv | uniq
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

You could also do some AWK magic:

$ awk -F, '{ lines[$1] = $0 } END { for (l in lines) print lines[l] }' test.csv
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0

This isn't unique *by column* as asked for in the question. This is just unique for the entire line. Also, you don't have to do a sort to do a uniq. The two are mutually exclusive. — Javid Jamae, Sep 24 '14 at 19:47
Yes, you are right. The last example does what the question asked for though, even though the accepted answer is a lot cleaner. Regarding `sort`, then `uniq`, `sort` needs to be done before doing `uniq` otherwise it doesn't work (but you can skip the second command and just use `sort -u`). From [`uniq(1)`](http://linux.die.net/man/1/uniq): "Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output)." — Mikael S, Sep 25 '14 at 06:13

score -3 · Answer 9 · answered Dec 16 '09 at 16:13

well, simpler than isolating the column with awk, if you need to remove everything with a certain value for a given file, why not just do grep -v:

e.g. to delete everything with the value "col2" in the second place line: col1,col2,col3,col4

grep -v ',col2,' file > file_minus_offending_lines

If this isn't good enough, because some lines may get improperly stripped by possibly having the matching value show up in a different column, you can do something like this:

awk to isolate the offending column: e.g.

awk -F, '{print $2 "|" $line}'

the -F sets the field delimited to ",", $2 means column 2, followed by some custom delimiter and then the entire line. You can then filter by removing lines that begin with the offending value:

 awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE

and then strip out the stuff before the delimiter:

awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE | sed 's/.*|//g'

(note -the sed command is sloppy because it doesn't include escaping values. Also the sed pattern should really be something like "[^|]+" (i.e. anything not the delimiter). But hopefully this is clear enough.

He does not want to purge lines, he wants to retain a single copy of a line with a specific string. Uniq is the right use case. — ingyhere, Nov 13 '15 at 01:34

Is there a way to 'uniq' by column?

9 Answers9

Linked

Related