113

How can I randomize the lines in a file using standard tools on Red Hat Linux?

I don't have the shuf command, so I am looking for something like a perl or awk one-liner that accomplishes the same task.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Stuart Woodward
  • 2,138
  • 4
  • 21
  • 31
  • 1
    I asked almost the same question [http://stackoverflow.com/questions/286640/how-can-i-print-the-lines-in-stdin-in-random-order-in-perl] – Steve Schnepp May 25 '09 at 08:08
  • possible duplicate of [How can I shuffle the lines of a text file in Unix command line?](http://stackoverflow.com/questions/2153882/how-can-i-shuffle-the-lines-of-a-text-file-in-unix-command-line) – skolima Mar 26 '14 at 10:49
  • I consider gcc a standard tool in any linux. ;D – msb Jul 16 '18 at 20:20

11 Answers11

226

Um, lets not forget

sort --random-sort
Jim T
  • 12,336
  • 5
  • 29
  • 43
  • Could you tell me which version of sort has this option? – Stuart Woodward May 21 '09 at 08:50
  • 1
    Well, I'm using gnu-coreutils 7.1 (standard gentoo install), which has sort with this option, not sure when it appeared, or if it's in other implementations. – Jim T May 21 '09 at 11:46
  • 1
    The feature was committed on 10th December 2005, the release following that was 5.94, so I'm guessing it's been available since that version. – Jim T May 21 '09 at 11:58
  • 41
    On OS X you can install gnu coreutils with homebrew: `brew install coreutils` All the utils are prefixed with a g so: `gsort --random-sort` or `gshuf` will work as expected – mike Aug 21 '13 at 04:14
  • 3
    +1 @mike. I use Macports and I also had `gsort` and `gshuf` installed when I did `port install coreutils` – Noah Sussman Sep 04 '13 at 03:06
  • 10
    This solution is only good if your lines do not have repetitions. If they do, all instances of that line will appear next to each other. Consider using `shuf` instead (on linux). – Ali J May 07 '14 at 20:16
  • `cat ordered.txt|sort --random-sort>shuffle.txt` solved my problem. Thanks! –  Sep 17 '14 at 15:28
  • To reiterate @AliJ's warning: _duplicate input lines_ will invariably sort _next to each other_ - this will only be a true shuffle if all input lines are unique. – mklement0 May 08 '15 at 22:03
  • cool! this also works from within vim with :%!sort --random-sort – Dalker Dec 05 '16 at 18:46
131

shuf is the best way.

sort -R is painfully slow. I just tried to sort 5GB file. I gave up after 2.5 hours. Then shuf sorted it in a minute.

s g
  • 5,289
  • 10
  • 49
  • 82
Michal Illich
  • 1,835
  • 1
  • 14
  • 10
  • This is great. It appears to be in GNU coreutils. – ariddell Mar 27 '13 at 13:06
  • 4
    I suspect the reason `sort -R` is slow is that computes a hash for each line. From the docs: "[Sort by hashing the input keys and then sorting the hash values.](http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html#index-random-sort-554)" – Joe Flynn Jun 13 '13 at 18:40
  • A 1.7GB file and 28 million lines on Linux Debian Wheezy was randomized in a few minutes, acceptable – Lothar Sep 05 '14 at 00:00
  • 16
    beware, `shuf` loads everything in memory. – jfs Oct 27 '14 at 07:54
  • If "sort" is slow then try to give it more memory. The reason for being slow most probably is not hashing but writing to disk in order to use less main memory. Try e.g. "sort -S 40G -R" if you have enough main memory. – benroth Apr 30 '15 at 14:56
  • 1
    @benroth: From what I can tell, with really large input counts increasing the memory can help _somewhat_, but it's still slow overall. In my tests, sorting a 1-million-line input file created with `seq -f 'line %.0f' 1000000` took the same, _long_ time to process (much, much longer than with `shuf`), no matter how much memory I allocated. – mklement0 May 08 '15 at 23:00
  • 1
    @mklement0, you are right! I just tried it with a much bigger file than what I had before, and the hashing seems to be the bottleneck indeed. – benroth May 11 '15 at 18:41
66

And a Perl one-liner you get!

perl -MList::Util -e 'print List::Util::shuffle <>'

It uses a module, but the module is part of the Perl code distribution. If that's not good enough, you may consider rolling your own.

I tried using this with the -i flag ("edit-in-place") to have it edit the file. The documentation suggests it should work, but it doesn't. It still displays the shuffled file to stdout, but this time it deletes the original. I suggest you don't use it.

Consider a shell script:

#!/bin/sh

if [[ $# -eq 0 ]]
then
  echo "Usage: $0 [file ...]"
  exit 1
fi

for i in "$@"
do
  perl -MList::Util -e 'print List::Util::shuffle <>' $i > $i.new
  if [[ `wc -c $i` -eq `wc -c $i.new` ]]
  then
    mv $i.new $i
  else
    echo "Error for file $i!"
  fi
done

Untested, but hopefully works.

Chris Lutz
  • 73,191
  • 16
  • 130
  • 183
  • To backup the original file, you can suffix an extension to the -i flag [http://perldoc.perl.org/perlrun.html] – Steve Schnepp May 25 '09 at 08:11
  • I'm usually a Perl fan, but came across this ruby example which has the benefit of being shorter: `ruby -e 'puts STDIN.readlines.shuffle'`. It would need testing on big inputs to see if the speed is comparable. (also works on OS X) – mivk May 17 '15 at 21:48
  • per comment below, `shuf` loads everything into memory, so it doesn't work with a truly huge file (mine is ~300GB tsv). This perl script failed on mine too, but with no error except `Killed`. Any idea if the perl solution is loading everything into memory too, or is there some other problem I'm encountering? – seth127 Jan 17 '18 at 15:05
24
cat yourfile.txt | while IFS= read -r f; do printf "%05d %s\n" "$RANDOM" "$f"; done | sort -n | cut -c7-

Read the file, prepend every line with a random number, sort the file on those random prefixes, cut the prefixes afterwards. One-liner which should work in any semi-modern shell.

EDIT: incorporated Richard Hansen's remarks.

ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • 1
    This works, and is a creative solution, but will delete leading whitespace on lines. – Chris Lutz May 20 '09 at 05:39
  • @Chris changing the last cut to |sed 's/^[^\t]*\t//' should fix that – bdonlan May 20 '09 at 05:43
  • Kudos to the simplicity of the approach! – Shashikant Kore May 20 '09 at 06:22
  • Nice try, but even changing to the sed command deletes whitespace. – Stuart Woodward May 20 '09 at 08:00
  • Try this: `cat yourfile.txt | while read f ; do printf "%05d %s\n" "$(( $RANDOM % 100000 ))" "$f"; done | sort -n | cut -c7-` – jwhitlock Oct 06 '10 at 15:19
  • +1 only one that worked for me out of the box on an old version of solaris – jasonk Jul 11 '12 at 11:48
  • +1 for portability though I went with the following to avoid the deletion of leading whitespaces: cat /path/to/file | awk 'BEGIN { srand() } { print rand() "\t" $0 }' | sort -n | cut -f2- > /path/to/random.file – CodeReaper Dec 21 '12 at 17:32
  • 3
    +1 for POSIX conformance (except for `$RANDOM`), but -1 for butchering the data. Replacing `while read f` with `while IFS= read -r f` will prevent `read` from removing leading and trailing whitespace (see [this answer](http://stackoverflow.com/a/6399568/712605)) and prevent processing of backslashes. Using a fixed-length random string will prevent `cut` from deleting leading whitespace. Result: `cat yourfile.txt | while IFS= read -r f; do printf "%05d %s\n" "$RANDOM" "$f"; done | sort -n | cut -c7-` – Richard Hansen Mar 29 '13 at 19:17
  • 3
    @Richard Hansen: Thanks, these suggested changes are obviously appropriate, I've edited my post. – ChristopheD Apr 08 '13 at 22:49
  • A clever solution, but any solution involving a `bash` loop will perform reasonably only for small input-line counts. – mklement0 May 08 '15 at 22:55
  • Very useful. One addition: "sort -n -k 1" so that the sort is only on the first added column and not the rest of the line – Rani Nelken Jul 01 '15 at 23:14
10

A one-liner for python:

python -c "import random, sys; lines = open(sys.argv[1]).readlines(); random.shuffle(lines); print ''.join(lines)," myFile

And for printing just a single random line:

python -c "import random, sys; print random.choice(open(sys.argv[1]).readlines())," myFile

But see this post for the drawbacks of python's random.shuffle(). It won't work well with many (more than 2080) elements.

Community
  • 1
  • 1
scai
  • 20,297
  • 4
  • 56
  • 72
5

Related to Jim's answer:

My ~/.bashrc contains the following:

unsort ()
{
    LC_ALL=C sort -R "$@"
}

With GNU coreutils's sort, -R = --random-sort, which generates a random hash of each line and sorts by it. The randomized hash wouldn't actually be used in some locales in some older (buggy) versions, causing it to return normal sorted output, which is why I set LC_ALL=C.


Related to Chris's answer:

perl -MList::Util=shuffle -e'print shuffle<>'

is a slightly shorter one-liner. (-Mmodule=a,b,c is shorthand for -e 'use module qw(a b c);'.)

The reason giving it a simple -i doesn't work for shuffling in-place is because Perl expects that the print happens in the same loop the file is being read, and print shuffle <> doesn't output until after all input files have been read and closed.

As a shorter workaround,

perl -MList::Util=shuffle -i -ne'BEGIN{undef$/}print shuffle split/^/m'

will shuffle files in-place. (-n means "wrap the code in a while (<>) {...} loop; BEGIN{undef$/} makes Perl operate on files-at-a-time instead of lines-at-a-time, and split/^/m is needed because $_=<> has been implicitly done with an entire file instead of lines.)

ephemient
  • 198,619
  • 38
  • 280
  • 391
  • Reiterating that sort -R doesn't exist on OS X, but +1 for some great Perl answers, and a great answer in general. – Chris Lutz May 20 '09 at 16:40
  • You could install GNU coreutils on OS X, but (as I've done in the past) you have to be careful not to break the built-in tools... That being said, OP is on Redhat Linux, which definitely has GNU coreutils standard. – ephemient May 20 '09 at 16:49
3

When I install coreutils with homebrew

brew install coreutils

shuf becomes available as n.

John McDonnell
  • 1,470
  • 2
  • 18
  • 19
1

Mac OS X with DarwinPorts:

sudo port install unsort
cat $file | unsort | ...
Coroos
  • 371
  • 2
  • 9
1

FreeBSD has its own random utility:

cat $file | random | ...

It's in /usr/games/random, so if you have not installed games, you are out of luck.

You could consider installing ports like textproc/rand or textproc/msort. These might well be available on Linux and/or Mac OS X, if portability is a concern.

Coroos
  • 371
  • 2
  • 9
-1

On OSX, grabbing latest from http://ftp.gnu.org/gnu/coreutils/ and something like

./configure make sudo make install

...should give you /usr/local/bin/sort --random-sort

without messing up /usr/bin/sort

Dan Brickley
  • 531
  • 7
  • 9
-1

Or get it from MacPorts:

$ sudo port install coreutils

and/or

$ /opt/local//libexec/gnubin/sort --random-sort
NullUserException
  • 83,810
  • 28
  • 209
  • 234