Why cant i print unique lines from a file using bash

Question

I have the following straightforward file:

ANE 
ANE
AHL 
AHL
ANI 
ANI
ANJ 
ANJ
ANK 
ANK
ANL 
ANL
ANM 
ANM
ANN 
ANN
ANO 
ANO
ANP 
ANP
ANQ 
ANQ
ANR 
ANR
AMY 
AMY
AMZ 
AMZ

I would like to be able to get this file:

ANE
AHL
ANI
ANJ
ANK
ANL
ANM
ANN
ANO
ANP 
ANQ 
ANR 
AMY 
AMZ

I have tried several iterations of awk: awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file.txt

and several iterations of sort/uniq like: sort file.txt | uniq -u

and somehow I STILL keep getting all the duplicates in the print out..please help me understand what I'm doing wrong.

There is one trailing space in every second row. Is that intentional? — Cyrus, Apr 21 '23 at 17:43
@Robert, I've tried just uniq ...still getting duplicates. example `uniq file.txt` ...some of the output: AHL AHL ANI ANI — user3452868, Apr 21 '23 at 17:50
do all lines have just a single string? if some lines have multiple strings then please update the question to show some of these lines — markp-fuso, Apr 21 '23 at 17:51
removed trailing spaces after each 3-letter word in the "results im looking for file", sorry was not intentional — user3452868, Apr 21 '23 at 17:52
no, no multiple strings, every line is just a 3-letter "word" like ANP and AMZ etc — user3452868, Apr 21 '23 at 17:52
Once you've got rid of the extra spaces, an easy way to do what you want is `sort -u file.txt`. — pjh, Apr 21 '23 at 18:19

markp-fuso · Answer 1 · 2023-04-21T18:22:32.633

If the data is already ordered such that duplicate strings reside on successive lines (as in the example), and assuming all lines contain no white space:

$ uniq file2.txt
ANE
AHL
ANI
ANJ
ANK
ANL
ANM
ANN
ANO
ANP
ANQ
ANR
AMY
AMZ

Assuming the duplicates may not be on successive lines, assuming all lines contain no white space:

$ sort -u file2.txt
AHL
AMY
AMZ
ANE
ANI
ANJ
ANK
ANL
ANM
ANN
ANO
ANP
ANQ
ANR

Now, if the duplicates are not located on successive lines and/or white space may exist in various lines, we'll look at some ideas to address OP's current awk code ...

The provided sample includes trailing spaces on some lines so your awk code ...

awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}'

... which references then entire line ($0) is going to treat ABC and ABC differently.

Assuming each line only has a single string then the current code should replace $0 with $1 to strip off unwanted spaces, eg:

awk '{!seen[$1]++};END{for(i in seen) if(seen[i]==1)print i}'

But this still isn't sufficient because it's looking for only those strings that show up just once (seen[i] == 1); to print a unique list of strings consider:

awk '{!seen[$1]++};END{for(i in seen) print i}'

But if we just need a unique set of array indices then the 'not' (!) and increment (++) are superfluous, so we could further reduce this to:

awk '{seen[$1]};END{for(i in seen) print i}'

Now, since the order of the output doesn't appear to be a requirement we could keep the 'not' (!) and increment (++) and eliminate the END{} block; instead we'll print a string the first time we see it and then ignore it for the rest of the script:

awk '!seen[$1]++' file2.txt

This generates:

ANE
AHL
ANI
ANJ
ANK
ANL
ANM
ANN
ANO
ANP
ANQ
ANR
AMY
AMZ

OMG! I cant believe I overlooked the unintentional spaces after some of those 3-letters! thank you so much for pointing that out, youre right, this was issue. thank you thank you for the thorough answer!! — user3452868, Apr 21 '23 at 17:59

Why cant i print unique lines from a file using bash

1 Answers1