Disk space required for unix sort

Question

I am currently doing a UNIX sort (via GitBash on a Windows machine) of a 500GB text file. Due to running out of space on the main disk, I have used the -T option to direct the temp files to a disk where I have enough space to accommodate the entire file. The thing is, I've been watching the disk space and apparently the temp files are already in excess of what the original file was. I don't know how much further this is going to go, but I'm wondering if there is a rule by which I can predict how much space I will need for temp files.

If the Unix sort works similar to the GNU sort, then the initial pass creates temp files based on ram size. Assuming there's 1GB of ram that can be used for sorting in memory, then it would create 500 1gb files, then do repeated 16 way merges on those files. Also assuming that it deletes files after each merge, then it would need 516 gb of space, with each file rounded up to a cluster (file allocation) size boundary. On the last merge, the required disk space will be double the file size (rounded up to cluster bondary), so a bit over 1,000 GB. — rcgldr, Aug 10 '16 at 19:22
Rats! It looks like I'm going to have to invest in some more storage. — Stonecraft, Aug 10 '16 at 19:50
Is that last merge also in the temp folder? I directed the output to a different location than the temp in hopes of avoiding having two complete copies of the file on one disk. — Stonecraft, Aug 10 '16 at 20:34
I underestimated the temp file space. Assume the first merge pass creates 500 1gb files. The next phase merges 16 1gb files into 16gb files, taking 516 gb of space. The next phase merges 16 16gb files into 256gb files, taking 756gb of space. The last phase merges one 256 gb and one 244 gb file to create the 500 gb output file. If the output file is on another disk, then the space required on the temp disk is about 756 gb. — rcgldr, Aug 11 '16 at 00:28
Well, what ended up happening was that the 32 temp files reduced to 16 temp files (~10Gb each), and 99gb of the final file was written to the target disk, then my computer hung for a moment. Then, I got a `sort: write failed: standard output: No space left on device`, the temp folder was empty, and the disk it was on had the 869Gb free space I started with. — Stonecraft, Aug 11 '16 at 03:24
Is there a -s (memory size) option? If so, try "-s 1G" to limit buffer and initial temp file size to 1GB. How large were those 32 temp files? — rcgldr, Aug 11 '16 at 04:11
Unfortunately, I didn't think to look until it had gotten down to 16. — Stonecraft, Aug 11 '16 at 07:25
Try --batch-size=2. That will take longer, but reduce disk space. — rcgldr, Aug 11 '16 at 07:46

score 1 · Answer 1 · answered Jul 06 '17 at 01:58

I'd batch it manually as described in this unix.SE answer.

Find some very basic queries that will divide your content into chunks that are small enough to be sorted. For example, if it's a file of words, you could create queries like grep ^a …, grep ^b …, and so on. Some items may need more granularity than others.

You can script that like:

#!/bin/bash
for char1 in other {0..9} {a..z}; do
  out="/tmp/sort.$char1.xz"
  echo "Extracting lines starting with '$char1'"
  if [ "$char1" = "other" ]; then char1='[^a-z0-9]'; fi
  grep -i "^$char1" *.txt |xz -c0 > "$out"
  unxz -c "$out" |sort -u >> output.txt || exit 1
  rm "$out"
done
echo "It worked"

I'm using xz -0 because it's almost as fast as gzip's default gzip -6 yet it's vastly better at conserving space. I omitted it from the final output in order to preserve the exit value of sort -u, but you could instead use a size check (iirc, sort fails with zero output) and then use sort -u |xz -c0 >> output.txt.xz since the xz (and gzip) container lets you concatenate archives (I've written about that before too).

This works because the output of each grep run is already sorted (0 is before 1, which is before a, etc.), so the final assembly doesn't need to run through sort (note, the "other" section will be slightly different since some non-alphanumeric characters are before the numbers, others are between numbers and letters, and others still are after the letters. You can also remove grep's -i flag and additionally iterate through {A..Z} to be case sensitive). Each individual iteration obviously still needs to be sorted, but hopefully they're manageable.

If the program exits before completing all iterations and saying "It worked" then you can edit the script with a more discrete batch for the last iteration it tried. Remove all prior iterations since they're successfully saved in output.txt.

Disk space required for unix sort

1 Answers1