Select unique or distinct values from a list in UNIX shell script

Question

I have a ksh script that returns a long list of values, newline separated, and I want to see only the unique/distinct values. It is possible to do this?

For example, say my output is file suffixes in a directory:

tar
gz
java
gz
java
tar
class
class

I want to see a list like:

tar
gz
java
class

score 551 · Accepted Answer · edited May 23 '17 at 10:31

551

You might want to look at the uniq and sort applications.

./yourscript.ksh | sort | uniq

(FYI, yes, the sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other)

EDIT:

Contrary to what has been posted by Aaron Digulla in relation to uniq's commandline options:

Given the following input:

class
jar
jar
jar
bin
bin
java

uniq will output all lines exactly once:

class
jar
bin
java

uniq -d will output all lines that appear more than once, and it will print them once:

jar
bin

uniq -u will output all lines that appear exactly once, and it will print them once:

class
java

edited May 23 '17 at 10:31

Community

1
1

answered Mar 06 '09 at 10:34

Matthew Scharley

127,823
52
194
222

2

Just an FYI for latecomers: @AaronDigulla's answer has since been corrected. – mklement0 Jan 18 '14 at 07:16
3

very good point this ` sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other` which I have just learnt!! – HattrickNZ Apr 15 '15 at 20:15
4

GNU `sort` features a `-u` version for giving the unique values too. – Mingye Wang Dec 09 '15 at 05:05
1

I figured out that `uniq` seams to process only adjacent lines (at least by default) meaning one may `sort` input before feeding `uniq`. – Stphane Feb 19 '16 at 00:28
I did some testing on 400MB of data - `sort | uniq` was 95 seconds - `sort -u` was 77 - `awk '!a[$0]++'` from @ajak6 was 9 seconds. So awk wins but also the hardest to remember. – MikeKulls Aug 27 '21 at 00:26

score 111 · Answer 2 · edited May 23 '17 at 11:47

111

./script.sh | sort -u

This is the same as monoxide's answer, but a bit more concise.

edited May 23 '17 at 11:47

Community

1
1

answered Mar 06 '09 at 14:44

gpojd

22,558
8
42
71

10

You're being modest: your solution will also _perform_ better (probably only noticeable with large data sets). – mklement0 Jan 18 '14 at 07:20
I think that should be more efficient than `... | sort | uniq` because it is performed in one shot – Adrian Antunez Aug 06 '18 at 14:43
2

@AdrianAntunez maybe it's also because the `sort -u` doesn't need to update the sorted list each time it finds an already encountered earlier value. while the `sort |` has to sort _all_ items before it passes it to `uniq` – whyer Nov 10 '20 at 15:02
1

@mklement0 @AdrianAntunez At the first time I thought `sort -u` could be faster because any optimal comparison sort algorithm has `O(n*log(n))` complexity, but it is possible to find all unique values with `O(n)` complexity using Hash Set data structure. Nonetheless, both `sort -u` and `sort | uniq` have almost the same performance and they both are slow. I have conducted some tests on my system, more info at https://gist.github.com/sda97ghb/690c227eb9a6b7fb9047913bfe0e431d – Divano Dec 01 '21 at 12:11
Thanks! Your solution worked for me, while `./script.sh | sort | uniq -u` didn't output anything. Maybe because the output was too large? Although it wasn't so big, the output had 50_000 lines, with just 4 distinct values. – Ferran Maylinch Jan 31 '23 at 16:12

score 16 · Answer 3 · edited Nov 19 '19 at 05:49

16

With zsh you can do this:

% cat infile 
tar
more than one word
gz
java
gz
java
tar
class
class
zsh-5.0.0[t]% print -l "${(fu)$(<infile)}"
tar
more than one word
gz
java
class

Or you can use AWK:

% awk '!_[$0]++' infile    
tar
more than one word
gz
java
class

edited Nov 19 '19 at 05:49

ian

12,003
9
51
107

answered Mar 06 '09 at 12:06

Dimitre Radoulov

27,252
4
40
48

3

Clever solutions that do not involve sorting the input. Caveats: The very-clever-but-cryptic `awk` solution (see http://stackoverflow.com/a/21200722/45375 for an explanation) will work with large files as long as the number of unique lines is small enough (as unique lines are kept in memory). The `zsh` solution reads the entire file into memory first, which may not be an option with large files. Also, as written, only lines with no embedded spaces are handled correctly; to fix this, use `IFS=$'\n' read -d '' -r -A u – mklement0 Jan 18 '14 at 08:18
Correct. Or: `(IFS=$'\n' u=($( – Dimitre Radoulov Jan 18 '14 at 16:42
1

Thanks, that's simpler (assuming you don't need to set variables needed outside the subshell). I'm curious as to when you need the `[@]` suffix to reference all elements of an array - seems that - at least as of version 5 - it works without it; or did you just add it for clarity? – mklement0 Jan 18 '14 at 17:08
1

@mklement0, you're right! I didn't think of it when I wrote the post. Actually, this should be sufficient: `print -l "${(fu)$( – Dimitre Radoulov Jan 18 '14 at 17:17
1

Fantastic, thanks for updating your post - I took the liberty of fixing the `awk` sample output, too. – mklement0 Jan 18 '14 at 17:30

Ajak6 · Answer 4 · 2021-08-28T06:16:02.090

15

With AWK you can do:

 ./yourscript.ksh | awk '!a[$0]++'

I find it faster than sort and uniq

edited Aug 28 '21 at 06:16

answered May 22 '17 at 21:27

Ajak6

727
5
17

That's definitely my favorite way to do the job, thanks a lot! Especially for larger files, the sort|uniq-solutions are probably not what you want. – Schmitzi Sep 30 '19 at 13:04
I did some testing and this was 10 times faster than other solutions, but also 10x harder to remember :-) – MikeKulls Aug 27 '21 at 00:31
1

Yeah, I'm not quite sure what awk is doing here. But thanks for the solution!! – Barbituate Sep 17 '21 at 16:05

Aaron Digulla · Answer 5 · 2015-05-29T11:25:12.267

13

Pipe them through sort and uniq. This removes all duplicates.

uniq -d gives only the duplicates, uniq -u gives only the unique ones (strips duplicates).

edited May 29 '15 at 11:25

answered Mar 06 '09 at 10:35

Aaron Digulla

321,842
108
597
820

gotta sort first by the looks of it – brabster Mar 06 '09 at 10:35
1

Yes, you do. Or more accurately, you need to group all the duplicate lines together. Sorting does this by definition though ;) – Matthew Scharley Mar 06 '09 at 10:37
Also, `uniq -u` is NOT the default behaviour (see the edit in my answer for details) – Matthew Scharley Mar 06 '09 at 10:49

score 11 · Answer 6 · answered Mar 06 '09 at 11:02

11

For larger data sets where sorting may not be desirable, you can also use the following perl script:

./yourscript.ksh | perl -ne 'if (!defined $x{$_}) { print $_; $x{$_} = 1; }'

This basically just remembers every line output so that it doesn't output it again.

It has the advantage over the "sort | uniq" solution in that there's no sorting required up front.

answered Mar 06 '09 at 11:02

paxdiablo

854,327
234
1,573
1,953

2

Note that sorting of a very large file is not an issue per se with sort; it can sort files which are larger than the available RAM+swap. Perl, OTOH, will fail if there are only few duplicates. – Aaron Digulla Mar 06 '09 at 11:06
1

Yes, it's a trade-off depending on the expected data. Perl is better for huge dataset with many duplicates (no disk-based storage required). Huge dataset with few duplicates should use sort (and disk storage). Small datasets can use either. Personally, I'd try Perl first, switch to sort if it fails. – paxdiablo Mar 06 '09 at 11:33
Since sort only gives you a benefit if it has to swap to disk. – paxdiablo Mar 06 '09 at 11:34
5

This is great when I want the first occurrence of every line. Sorting would break that. – Bluu May 10 '12 at 19:30
Ultimately perl will be sorting the entries in some form to put into its dictionary (or whatever it is called in perl), so you can't actually avoid the processing time of a sort. – MikeKulls Aug 27 '21 at 00:30
e.g. `tail -F -n+1 urls.txt | perl -ne 'if (!defined $x{$_}) { print $_; $x{$_} = 1; }' | while read -r url; ...` this version works when you need to stream to another pipe immediately. +1 – Phillmac Nov 30 '21 at 01:23

FGrose · Answer 7 · 2012-07-31T03:54:19.330

Unique, as requested, (but not sorted);
uses fewer system resources for less than ~70 elements (as tested with time);
written to take input from stdin,
(or modify and include in another script):
(Bash)

bag2set () {
    # Reduce a_bag to a_set.
    local -i i j n=${#a_bag[@]}
    for ((i=0; i < n; i++)); do
        if [[ -n ${a_bag[i]} ]]; then
            a_set[i]=${a_bag[i]}
            a_bag[i]=$'\0'
            for ((j=i+1; j < n; j++)); do
                [[ ${a_set[i]} == ${a_bag[j]} ]] && a_bag[j]=$'\0'
            done
        fi
    done
}
declare -a a_bag=() a_set=()
stdin="$(</dev/stdin)"
declare -i i=0
for e in $stdin; do
    a_bag[i]=$e
    i=$i+1
done
bag2set
echo "${a_set[@]}"

score 0 · Answer 8 · answered Jan 20 '20 at 21:20

0

I get a better tips to get non-duplicate entries in a file

awk '$0 != x ":FOO" && NR>1 {print x} {x=$0} END {print}' file_name | uniq -f1 -u

answered Jan 20 '20 at 21:20

Mary Marty

1
1

Select unique or distinct values from a list in UNIX shell script

8 Answers8

Linked