How could the UNIX sort command sort a very large file?

Question

The UNIX sort command can sort a very large file like this:

sort large_file

How is the sort algorithm implemented?

How come it does not cause excessive consumption of memory?

This is interesting. I don't really know how it works, but I have a guess. It probably puts the first character of each key into a binary tree, and when there is a collision, it uses the next character of the key also, so it doesn't save more of the key than it needs to. It may then save an offset into the file with each key so it can seek back and print each line in order. — Zifre, May 30 '09 at 16:28
Actually, @ayaz it's more interesting if you aren't sorting a file on disk but rather in a pipe since it makes it obvious that you can't simply do multiple passes over the input data. — tvanfosson, May 30 '09 at 16:31
Why does everyone on SO feel so impelled to guess all the time? — , May 30 '09 at 16:31
You can do multiple passes on the input - you just need to read all the input, write it to disk, and then sort the disk file. — , May 30 '09 at 16:32
@Neil - from the context it seemed obvious that he was trying to sort the contents of the file not the file name (which for one name is meaningless). I just wanted to improve the question without changing the context too much so that it would get answers instead of downvotes because of a simple mistake. — tvanfosson, May 30 '09 at 16:41
@Neil -- my point was that using the pipe makes it obvious that you don't have access to the original file and naive implementations that make multiple passes over the input data won't work. That makes the question (and implementation) more interesting. — tvanfosson, May 30 '09 at 16:44
@tvanfosson this indeed is a mistake, I'm very sorry for this mistake — yjfuk, May 31 '09 at 01:27
http://unix.stackexchange.com/questions/120096/how-to-sort-big-files — Ciro Santilli OurBigBook.com, Nov 22 '15 at 10:02

score 124 · Accepted Answer · answered May 30 '09 at 16:26

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

score 52 · Answer 2 · answered May 30 '09 at 16:26

52

The sort command stores working data in temporary disk files (usually in /tmp).

answered May 30 '09 at 16:26

user1686

13,155
2
35
54

25

use `-T` to specify the temp dir – glenn jackman Jun 22 '11 at 00:37

score 12 · Answer 3 · edited Oct 23 '12 at 08:13

12

#!/bin/bash

usage ()
{
    echo Parallel sort
    echo usage: psort file1 file2
    echo Sorts text file file1 and stores the output in file2
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2

edited Oct 23 '12 at 08:13

Taz

3,718
2
37
59

answered Oct 23 '12 at 07:46

Sergio

129
1
2

This is excellent. Wasn't aware that there was a parallel package ! Sort time improved by more that 50% after using the above. Thanks. – xbsd Jul 14 '13 at 00:14
I tried to use comm for diff on the files generated by this and its giving me warning that files are not sorted. – ashishb Mar 01 '14 at 01:56

score 12 · Answer 4 · edited Oct 21 '16 at 21:30

12

WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.

Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !

#! /bin/ksh

MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted

usage ()
{
     echo Parallel sort
     echo usage: psort file1 file2
     echo Sorts text file file1 and stores the output in file2
     echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
     echo  and each chunk will be sorted in parallel
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE

#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX

for file in $CHUNK_FILE_PREFIX*
do
    sort $file > $file.sorted &
done
wait

#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null

See also: "Sorting large files faster with a shell script"

edited Oct 21 '16 at 21:30

Alexis Wilke

19,179
10
84
156

answered Mar 02 '10 at 11:31

Adrian

6,013
10
47
68

38

You can just use sort --parallel N as of GNU sort version 8.11 – jhclark Sep 28 '11 at 13:09
5

GNU coreutils 8.6 actually – bdeonovic Oct 17 '14 at 14:16
1

This one did the trick for me. I have sort 8.4 version. Using sort directly on the file (190 million lines) was going no where. This program did it with just under 4 minutes – Sunil B Apr 30 '15 at 21:56
again, this answer has nothing to do with the question – WattsInABox Jul 28 '16 at 12:47
2

This script is dangerous. My Linux machine lost response after launching hundreds of sort processes… – Yongwei Wu Sep 26 '16 at 01:54
@YongweiWu, that's what I was just looking at. If the input file is broken up in 100 files, then it will start 100 `sort -u` in the for loop! – Alexis Wilke Oct 21 '16 at 21:29
I use sort all the time, memory use / time taken has never been an issue. Completely mixed up list of 5,509,041 url's with parameter strings sorted uniquely in 0m10.539s – MitchellK Apr 21 '18 at 14:05
1

@WattsInABox It's called a subtle flex. – NoName Jun 03 '21 at 18:38

score 11 · Answer 5 · answered May 30 '09 at 16:29

I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.

score 7 · Answer 6 · answered Jun 04 '13 at 21:18

Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem. Key parameters on Ubuntu are

Location of temporary files -T directory_name
Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like "-S 80%" to use 80% of available RAM, or "-S 2G" for 2 GB RAM.)

The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.

trying this out on a 2.5GB file, on a box with 64GB of RAM with -S 80%, it is actually using that full percentage, even though the entire file is smaller than that. why is that? even if it doesn't use an in-place sort that seems gratuitous — Joseph Garvin, Jan 04 '16 at 21:36
Probably sort -S pre-allocates the memory for the sort process before even reading the contents of file. — Fred Gannett, Oct 16 '17 at 10:07

score -2 · Answer 7 · edited Oct 14 '20 at 05:33

-2

How to use -T option for sorting large file

I have to sort a large file's 7th column.

I was using:

grep vdd  "file name" | sort -nk 7 |

I faced below error:

******sort: write failed: /tmp/sort1hc37c: No space left on device******

then I used -T option as below it worked:

grep vdda  "file name" | sort -nk 7  -T /dev/null/ |

edited Oct 14 '20 at 05:33

32cupo

850
5
18
36

answered Oct 13 '20 at 05:26

Manochitra Maniraj

1

1

Please use another example directory than /dev/null. – Karl Tarbe May 06 '21 at 22:45

score -3 · Answer 8 · answered Jun 21 '11 at 22:27

Memory should not be a problem - sort already takes care of that. If you want make optimal usage of your multi-core CPU I have implementend this in a small script (similar to some you might find on the net, but simpler/cleaner than most of those ;)).

#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
# 
# psort largefile.txt 20m 4    
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
    let i++
    sort $fname > $fname.$suffix &
    mres=$(($i % $nthreads))
    test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix 
rm $1.part*

Interesting script, but it does nothing to answer this question. — Joachim Sauer, Sep 01 '11 at 11:49
split -b will split by bytes, thus truncating the lines at an arbitrary position — mkm, Feb 14 '12 at 11:50

How could the UNIX sort command sort a very large file?

8 Answers8

Linked

Related