3

I can remove duplicate entries from small text files, but not large text files.
I have a file that's 4MB.
The beginning of the file looks like this:

aa
aah
aahed
aahed
aahing
aahing
aahs
aahs
aal
aalii
aalii
aaliis
aaliis
...

I want to remove the duplicates.
For example, "aahed" shows up twice, and I would only like it to show up once.

No matter what one-liner I've tried, the big list will not change.

If It type: sort big_list.txt | uniq | less
I see:

aa
aah
aahed
aahed   <-- didn't get rid of it
aahing
aahing   <-- didn't get rid of it
aahs
aahs   <-- didn't get rid of it
aal
...

However, If I copy a small chunk of words from the top of this text file and re-run the command on the small chunk of data, it does what's expected.

Are these programs refusing to sort because the file is too big? I didn't think 4MB was very big. It doesn't output a warning or anything.

I quickly wrote my own "uniq" program, and it has the same behavior. It works on a small subset of the list, but doesn't do anything to the 4MB text file. What's my issue?

EDIT: Here is a hex dump:

00000000  61 61 0a 61 61 68 0a 61  61 68 65 64 0a 61 61 68  |aa.aah.aahed.aah|
00000010  65 64 0d 0a 61 61 68 69  6e 67 0a 61 61 68 69 6e  |ed..aahing.aahin|
00000020  67 0d 0a 61 61 68 73 0a  61 61 68 73 0d 0a 61 61  |g..aahs.aahs..aa|
00000030  6c 0a 61 61 6c 69 69 0a  61 61 6c 69 69 0d 0a 61  |l.aalii.aalii..a|
00000040  61 6c 69 69 73 0a 61 61  6c 69 69 73 0d 0a 61 61  |aliis.aaliis..aa|

61 61 68 65 64 0a
a  a  h  e  d  \r

61 61 68 65 64 0d
a  a  h  e  d  \n

Solved: Different line delimiters

Trevor Hickey
  • 36,288
  • 32
  • 162
  • 271
  • 2
    Are you sure there isn't any trailing whitespace or other invisible characters on some of those lines? `uniq` shouldn't care how large the file is, since (due to its requirement that the file already be sorted) it only needs to store a couple of lines in memory at a time. – rra Mar 19 '13 at 07:34
  • 1
    Is it possible the lines differ in a way that it's not obvious? For example in white space or line separator char – Joni Mar 19 '13 at 07:35
  • Nothing obvious, but I suppose that may be the case. However, when I paste a bit of it into a different file and do it again it works. wouldn't I have pasted those characters too? Maybe you're right though, and it's some strange invisible ASCII character that my clipboard doesn't pickup........ – Trevor Hickey Mar 19 '13 at 07:40
  • 1
    Perhaps lines differ in trailing spaces? – Axel Mar 19 '13 at 07:41
  • maybe you can use `head` command to test if it works on the first few lines only? – Larry Mar 19 '13 at 07:47
  • @Xploit could you write it up as an answer and accept that? This will help others in the future – sehe Mar 19 '13 at 08:06

4 Answers4

6

The sort(1) command accepts a -u option for uniqueness of key.

Just use

 sort -u big_list.txt
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • should work yes, but I still see duplicates when I pipe that into "less". BUT, If I take that chunk I see from the output of "less", paste it into a different file, and run the same commands on that file, it removes the duplicates. – Trevor Hickey Mar 19 '13 at 07:35
  • 2
    Pipe it into a file and use a hexeditor for comparing the allegedly duplicated lines. – scai Mar 19 '13 at 07:36
  • 3
    @skai in hex, 0a 0d where the different line delimiters.. *sigh*, thank you. – Trevor Hickey Mar 19 '13 at 07:48
4

You can normalize line delimeters (convert CR+LF to LF):

sed 's/\r//' big_list.txt | sort -u
max taldykin
  • 12,459
  • 5
  • 45
  • 64
3

To answer max taldykin's question about awk '!_[$0]++' file:

awk '!_[$0]++' file is the same as

awk '!seen[$0]++' file

, which is the same as

awk '!seen[$0]++ { print; }' file

, which means

awk '
    {
        if (!seen[$0]) {
            print;
        }
        seen[$0]++;
    }' file

Important points here:

  1. $0 means the current record which usually is the current line
  2. In awk, the ACTION part is optional and the default action is { print; }
  3. In arithmetic context, an uninitialized var is 0
Community
  • 1
  • 1
pynexj
  • 19,215
  • 5
  • 38
  • 56
  • This trick is often used in scripting languages which support *hash* or *associative arrays*. Here is [an example with Perl](http://www.perlmonks.org/?node=How%20can%20I%20extract%20just%20the%20unique%20elements%20of%20an%20array%3F). – pynexj Mar 20 '13 at 03:13
2

apart from sort -u you can also use awk '!_[$0]++' yourfile

user1939168
  • 547
  • 5
  • 20