118

The UNIX sort command can sort a very large file like this:

sort large_file

How is the sort algorithm implemented?

How come it does not cause excessive consumption of memory?

kojiro
  • 74,557
  • 19
  • 143
  • 201
yjfuk
  • 1,415
  • 4
  • 13
  • 12
  • This is interesting. I don't really know how it works, but I have a guess. It probably puts the first character of each key into a binary tree, and when there is a collision, it uses the next character of the key also, so it doesn't save more of the key than it needs to. It may then save an offset into the file with each key so it can seek back and print each line in order. – Zifre May 30 '09 at 16:28
  • Actually, @ayaz it's more interesting if you aren't sorting a file on disk but rather in a pipe since it makes it obvious that you can't simply do multiple passes over the input data. – tvanfosson May 30 '09 at 16:31
  • 4
    Why does everyone on SO feel so impelled to guess all the time? –  May 30 '09 at 16:31
  • You can do multiple passes on the input - you just need to read all the input, write it to disk, and then sort the disk file. –  May 30 '09 at 16:32
  • 2
    @Neil - from the context it seemed obvious that he was trying to sort the contents of the file not the file name (which for one name is meaningless). I just wanted to improve the question without changing the context too much so that it would get answers instead of downvotes because of a simple mistake. – tvanfosson May 30 '09 at 16:41
  • @Neil -- my point was that using the pipe makes it obvious that you don't have access to the original file and naive implementations that make multiple passes over the input data won't work. That makes the question (and implementation) more interesting. – tvanfosson May 30 '09 at 16:44
  • @tvanfosson this indeed is a mistake, I'm very sorry for this mistake – yjfuk May 31 '09 at 01:27
  • http://unix.stackexchange.com/questions/120096/how-to-sort-big-files – Ciro Santilli OurBigBook.com Nov 22 '15 at 10:02

8 Answers8

124

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

Matthew
  • 3,086
  • 2
  • 20
  • 9
52

The sort command stores working data in temporary disk files (usually in /tmp).

user1686
  • 13,155
  • 2
  • 35
  • 54
12
#!/bin/bash

usage ()
{
    echo Parallel sort
    echo usage: psort file1 file2
    echo Sorts text file file1 and stores the output in file2
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2
Taz
  • 3,718
  • 2
  • 37
  • 59
Sergio
  • 129
  • 1
  • 2
  • This is excellent. Wasn't aware that there was a parallel package ! Sort time improved by more that 50% after using the above. Thanks. – xbsd Jul 14 '13 at 00:14
  • I tried to use comm for diff on the files generated by this and its giving me warning that files are not sorted. – ashishb Mar 01 '14 at 01:56
12

WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.


Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !

#! /bin/ksh

MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted

usage ()
{
     echo Parallel sort
     echo usage: psort file1 file2
     echo Sorts text file file1 and stores the output in file2
     echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
     echo  and each chunk will be sorted in parallel
}

# test if we have two arguments on the command line
if [ $# != 2 ]
then
    usage
    exit
fi

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE

#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX

for file in $CHUNK_FILE_PREFIX*
do
    sort $file > $file.sorted &
done
wait

#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE

#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null

See also: "Sorting large files faster with a shell script"

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
Adrian
  • 6,013
  • 10
  • 47
  • 68
11

I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.

pico
  • 1,349
  • 2
  • 10
  • 15
7

Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem. Key parameters on Ubuntu are

  • Location of temporary files -T directory_name
  • Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like "-S 80%" to use 80% of available RAM, or "-S 2G" for 2 GB RAM.)

The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.

Fred Gannett
  • 111
  • 1
  • 4
  • trying this out on a 2.5GB file, on a box with 64GB of RAM with -S 80%, it is actually using that full percentage, even though the entire file is smaller than that. why is that? even if it doesn't use an in-place sort that seems gratuitous – Joseph Garvin Jan 04 '16 at 21:36
  • Probably sort -S pre-allocates the memory for the sort process before even reading the contents of file. – Fred Gannett Oct 16 '17 at 10:07
-2

How to use -T option for sorting large file

I have to sort a large file's 7th column.

I was using:

grep vdd  "file name" | sort -nk 7 |

I faced below error:

******sort: write failed: /tmp/sort1hc37c: No space left on device******

then I used -T option as below it worked:

grep vdda  "file name" | sort -nk 7  -T /dev/null/ |
32cupo
  • 850
  • 5
  • 18
  • 36
-3

Memory should not be a problem - sort already takes care of that. If you want make optimal usage of your multi-core CPU I have implementend this in a small script (similar to some you might find on the net, but simpler/cleaner than most of those ;)).

#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
# 
# psort largefile.txt 20m 4    
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
    let i++
    sort $fname > $fname.$suffix &
    mres=$(($i % $nthreads))
    test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix 
rm $1.part*