53

How can I print only those lines that appear exactly once in a file? E.g., given this file:

mountain
forest
mountain
eagle

The output would be this, because the line mountain appears twice:

forest
eagle
  • The lines can be sorted, if necessary.
Village
  • 22,513
  • 46
  • 122
  • 163
  • I think you can use dictionary. You can have a look on this link: http://stackoverflow.com/questions/1494178/how-to-define-hash-tables-in-bash –  May 19 '14 at 14:41
  • Does this answer your question? [Find unique lines](https://stackoverflow.com/questions/13778273/find-unique-lines) – Mad Physicist Oct 18 '21 at 13:03

3 Answers3

115

Use sort and uniq:

sort inputfile | uniq -u

The -u option would cause uniq to print only unique lines. Quoting from man uniq:

   -u, --unique
          only print unique lines

For your input, it'd produce:

eagle
forest

Obs: Remember to sort before uniq -u because uniq operates on adjacent lines. So what uniq -u actually does is to print lines that don't have identical neighbor lines, but that doesn't mean they are really unique. When you sort, all the identical lines get grouped together and only the lines that are really unique in the file will remain after uniq -u.

gmelodie
  • 411
  • 4
  • 18
devnull
  • 118,548
  • 33
  • 236
  • 227
19

Using awk:

awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • No need of going so complex. simple `uniq` command will do the job as well. – Rahul May 19 '14 at 14:45
  • 1. Its not complex and 2. It avoids expensive `sort` for larger files. – anubhava May 19 '14 at 14:46
  • @anubhava Nice awk. +1. But for it _is_ really simpler to use `uniq`. And keeping in the memory larger files - who knows - what is more expensive. Swapping or sorting. :) – clt60 May 19 '14 at 14:50
  • 2
    @anubhava just tested on 300k lines. This `awk` solution is 8 times faster than `sort|uniq`. – clt60 May 19 '14 at 14:55
  • @jm666: Thanks so much for running the test and verifying that awk command is faster than `sort|uniq`. – anubhava May 19 '14 at 15:06
  • 2
    Since we are iterating, we can quickly check and print only those which is seen just once. `awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file` but +1 none the less. – jaypal singh May 19 '14 at 15:41
  • 1
    Yes sure that can also be done, I just chose delete to free up some memory not sure how much will that help :) – anubhava May 19 '14 at 15:46
  • 1
    @anubhava Thats a valid point, but as the solution is right now, it will probably get confused when the number of dups are in odd numbers. For example, if you add another `mountain` row, it will print it as well. – jaypal singh May 19 '14 at 17:02
  • 1
    @jaypal: Ah that's very important point. I updated as you suggested, many thanks! – anubhava May 19 '14 at 17:24
  • @anubhava Thanks for the edit and you're always welcome. `:)` – jaypal singh May 19 '14 at 17:26
  • @jm666 I tried with my `.xsession-errors.old` file (129315 lines), and the `sort | uniq` solution is 5 times _faster_ than this `awk` solution... – gniourf_gniourf May 19 '14 at 18:10
  • @gniourf_gniourf `sort` also has added benefit of writing the cache to disk if memory is not available. `awk` does not have that benefit. – jaypal singh May 19 '14 at 18:12
  • I created a `803200 lines` text file. My awk command took: `1.946s` whereas `sort|uniq` took `3.188s` on my OSX. – anubhava May 19 '14 at 18:32
  • 1
    my OS X is probably slow on IO, because i did: `gsort -uR /usr/share/dict/* > words.txt` (the gsort is the GNU version of sort - for getting randomly ordered file) - got 312123 lines. And tested both commands: `time sort words.txt | uniq -u >/dev/null` (got: 8.4 secs) and `time awk .... words.txt >/dev/null` got: 1.3 secs. So, for me (repeated few times) the awk is (nearly) 8 times faster than sort. – clt60 May 19 '14 at 19:20
9

You almost had the answer in your question:

sort filename | uniq -u

Oliver Matthews
  • 7,497
  • 3
  • 33
  • 36