"sort filename | uniq" does not work on large files

Question

I can remove duplicate entries from small text files, but not large text files.
I have a file that's 4MB.
The beginning of the file looks like this:

aa
aah
aahed
aahed
aahing
aahing
aahs
aahs
aal
aalii
aalii
aaliis
aaliis
...

I want to remove the duplicates.
For example, "aahed" shows up twice, and I would only like it to show up once.

No matter what one-liner I've tried, the big list will not change.

If It type: sort big_list.txt | uniq | less
I see:

aa
aah
aahed
aahed   <-- didn't get rid of it
aahing
aahing   <-- didn't get rid of it
aahs
aahs   <-- didn't get rid of it
aal
...

However, If I copy a small chunk of words from the top of this text file and re-run the command on the small chunk of data, it does what's expected.

Are these programs refusing to sort because the file is too big? I didn't think 4MB was very big. It doesn't output a warning or anything.

I quickly wrote my own "uniq" program, and it has the same behavior. It works on a small subset of the list, but doesn't do anything to the 4MB text file. What's my issue?

EDIT: Here is a hex dump:

00000000  61 61 0a 61 61 68 0a 61  61 68 65 64 0a 61 61 68  |aa.aah.aahed.aah|
00000010  65 64 0d 0a 61 61 68 69  6e 67 0a 61 61 68 69 6e  |ed..aahing.aahin|
00000020  67 0d 0a 61 61 68 73 0a  61 61 68 73 0d 0a 61 61  |g..aahs.aahs..aa|
00000030  6c 0a 61 61 6c 69 69 0a  61 61 6c 69 69 0d 0a 61  |l.aalii.aalii..a|
00000040  61 6c 69 69 73 0a 61 61  6c 69 69 73 0d 0a 61 61  |aliis.aaliis..aa|

61 61 68 65 64 0a
a  a  h  e  d  \r

61 61 68 65 64 0d
a  a  h  e  d  \n

Solved: Different line delimiters

Are you sure there isn't any trailing whitespace or other invisible characters on some of those lines? `uniq` shouldn't care how large the file is, since (due to its requirement that the file already be sorted) it only needs to store a couple of lines in memory at a time. — rra, Mar 19 '13 at 07:34
Is it possible the lines differ in a way that it's not obvious? For example in white space or line separator char — Joni, Mar 19 '13 at 07:35
Nothing obvious, but I suppose that may be the case. However, when I paste a bit of it into a different file and do it again it works. wouldn't I have pasted those characters too? Maybe you're right though, and it's some strange invisible ASCII character that my clipboard doesn't pickup........ — Trevor Hickey, Mar 19 '13 at 07:40
maybe you can use `head` command to test if it works on the first few lines only? — Larry, Mar 19 '13 at 07:47
@Xploit could you write it up as an answer and accept that? This will help others in the future — sehe, Mar 19 '13 at 08:06

score 6 · Answer 1 · answered Mar 19 '13 at 07:32

6

The sort(1) command accepts a -u option for uniqueness of key.

Just use

 sort -u big_list.txt

answered Mar 19 '13 at 07:32

Basile Starynkevitch

223,805
18
296
547

should work yes, but I still see duplicates when I pipe that into "less". BUT, If I take that chunk I see from the output of "less", paste it into a different file, and run the same commands on that file, it removes the duplicates. – Trevor Hickey Mar 19 '13 at 07:35
2

Pipe it into a file and use a hexeditor for comparing the allegedly duplicated lines. – scai Mar 19 '13 at 07:36
3

@skai in hex, 0a 0d where the different line delimiters.. *sigh*, thank you. – Trevor Hickey Mar 19 '13 at 07:48

score 4 · Accepted Answer · answered Mar 19 '13 at 11:12

4

You can normalize line delimeters (convert CR+LF to LF):

sed 's/\r//' big_list.txt | sort -u

answered Mar 19 '13 at 11:12

max taldykin

12,459
5
45
64

score 3 · Answer 3 · edited May 23 '17 at 11:54

3

To answer max taldykin's question about awk '!_[$0]++' file:

awk '!_[$0]++' file is the same as

awk '!seen[$0]++' file

, which is the same as

awk '!seen[$0]++ { print; }' file

, which means

awk '
    {
        if (!seen[$0]) {
            print;
        }
        seen[$0]++;
    }' file

Important points here:

$0 means the current record which usually is the current line
In awk, the ACTION part is optional and the default action is { print; }
In arithmetic context, an uninitialized var is 0

edited May 23 '17 at 11:54

Community

1
1

answered Mar 20 '13 at 02:26

pynexj

19,215
5
38
56

This trick is often used in scripting languages which support *hash* or *associative arrays*. Here is [an example with Perl](http://www.perlmonks.org/?node=How%20can%20I%20extract%20just%20the%20unique%20elements%20of%20an%20array%3F). – pynexj Mar 20 '13 at 03:13

user1939168 · Answer 4 · 2013-03-19T09:39:13.620

2

apart from sort -u you can also use awk '!_[$0]++' yourfile

edited Mar 19 '13 at 09:39

answered Mar 19 '13 at 07:35

user1939168

547
5
20

@clarkw, thanks but this does not help to understand what `!_[$0]++` means – max taldykin Mar 19 '13 at 09:40

"sort filename | uniq" does not work on large files

4 Answers4

Linked