337

I want to shuffle the lines of a text file randomly and create a new file. The file may have several thousands of lines.

How can I do that with cat, awk, cut, etc?

Joaquin
  • 2,013
  • 3
  • 14
  • 26
Ruggiero Spearman
  • 6,735
  • 5
  • 26
  • 37

19 Answers19

416

You can use shuf. On some systems at least (doesn't appear to be in POSIX).

As jleedev pointed out: sort -R might also be an option. On some systems at least; well, you get the picture. It has been pointed out that sort -R doesn't really shuffle but instead sort items according to their hash value.

[Editor's note: sort -R almost shuffles, except that duplicate lines / sort keys always end up next to each other. In other words: only with unique input lines / keys is it a true shuffle. While it's true that the output order is determined by hash values, the randomness comes from choosing a random hash function - see manual.]

Community
  • 1
  • 1
Joey
  • 344,408
  • 85
  • 689
  • 683
  • Cool, I hadn't known about shuf. It looks like it's part of coreutils but the version I have installed on my server doesn't have shuf. – Ruggiero Spearman Jan 28 '10 at 10:55
  • Yeah, seems to be GNU-only stuff, unfortunately. Nothing of it in POSIX. – Joey Jan 28 '10 at 10:58
  • 2
    It's odd that GNU coreutils has both `shuf` and `sort -R`. – Josh Lee Jan 28 '10 at 13:11
  • Note that shuf is probably a more proper version of what you're trying to do. Read this article on why shuffling naively can be a bad idea: http://www.codinghorror.com/blog/2007/12/the-danger-of-naivete.html – conradlee Jul 11 '11 at 23:01
  • 1
    conradlee: Both `sort -R` and `shuf` should do a reasonably well shuffle. If not, well, then the authors were morons. Nowhere I said anything of shuffling naïvely. – Joey Jul 12 '11 at 00:34
  • 39
    `shuf` and `sort -R` differ slightly, because `sort -R` randomly orders the elements according to **hash** of them, which is, `sort -R` will put the repeated elements together, while `shuf` shuffles all the elements randomly. – semekh Aug 28 '12 at 14:32
  • 2
    +1 for `shuf` -- It ran much faster in my case (shuffling 60gb file took maybe 20 minutes with `shuf` vs. `sort -R` was running for 1.5+ hours before I killed it) – Dolan Antenucci Sep 16 '12 at 14:12
  • The `rl` command is also available from the `randomize-lines` package, at least on Ubuntu. – Richard Sep 19 '12 at 07:23
  • 1
    @Richard: The author recommends using `shuf`. – Joey Sep 19 '12 at 12:29
  • 151
    For OS X users: `brew install coreutils`, then use `gshuf ...` (: – ELLIOTTCABLE Jan 24 '13 at 15:53
  • 15
    `sort -R` and `shuf` should be seen as completely different. `sort -R` is deterministic. If you call it twice at different times on the same input you will get the same answer. `shuf`, on the other hand, produces randomized output, so it will most likely give different output on the same input. – EfForEffort Feb 06 '13 at 15:41
  • 23
    That is not correct. "sort -R" uses a *different* random hash key each time you invoke it, so it produces different output each time. – Mark Pettit May 16 '14 at 21:30
  • 1
    FreeBSD has `sysutils/coreutils` in the Ports/pkg. Use `gshuf`. FYI. – jj1bdx Dec 16 '14 at 02:44
  • 1
    To clarify the `sort -R` issue: the results are randomly ordered _in general_, _except_ for _duplicate lines_ (keys), which always end up next to each other. The reason is that sorting is based on _hashes_ of the input lines (keys) generated by a _randomly chosen hash function_. – mklement0 May 02 '15 at 17:46
  • 4
    Note on randomness: per the GNU docs, "By default these commands use an internal pseudo-random generator initialized by a small amount of entropy, but can be directed to use an external source with the --random-source=file option." – Royce Williams Nov 26 '15 at 06:11
  • 2
    I assume `sort -R` would be slower than `shuf` since sorting is an `O(n logn)` operation, and shuffeling is `O(n)`? – Thomas Ahle Feb 08 '18 at 12:45
  • @ThomasAhle That would be assuming both programs are optimal (or equally suboptimal)… but `shuf` is indeed about 10x faster than `sort -R` for 100,000 short lines in my tests. (As an aside, `shuf -n ` does not offer noticeable performance improvements over `shuf | head -n `, suggesting it is not handled optimally.) – Arne Vogel Jul 23 '18 at 12:44
93

Perl one-liner would be a simple version of Maxim's solution

perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < myfile
Jens
  • 69,818
  • 15
  • 125
  • 179
Moonyoung Kang
  • 931
  • 6
  • 2
  • 7
    I aliased this to shuffle on OS X. Thanks! – The Unfun Cat Feb 22 '14 at 17:38
  • This was the only script on this page that returned REAL random lines. Other awk solutions often printed duplicate output. – Felipe Alvarez May 13 '14 at 06:38
  • 1
    But be careful because in the out you will lost one line :) It just will be joined with another line :) – JavaRunner Nov 03 '14 at 17:38
  • @JavaRunner: I assume you're talking about input without a trailing `\n`; yes, that `\n` must be present - and it typically _is_ - otherwise you'll get what you describe. – mklement0 May 02 '15 at 18:15
  • 1
    Wonderfully concise. I suggest replacing `` with `<>`, so the solution works with input from _files_ too. – mklement0 May 02 '15 at 18:16
  • The other answers suggest utilities that you may or may not already have on your system. Everyone has perl, though (and if you don't, then something you need will require it at some point). – Mars Sep 19 '17 at 01:20
68

This answer complements the many great existing answers in the following ways:

  • The existing answers are packaged into flexible shell functions:

    • The functions take not only stdin input, but alternatively also filename arguments
    • The functions take extra steps to handle SIGPIPE in the usual way (quiet termination with exit code 141), as opposed to breaking noisily. This is important when piping the function output to a pipe that is closed early, such as when piping to head.
  • A performance comparison is made.


  • POSIX-compliant function based on awk, sort, and cut, adapted from the OP's own answer:
shuf() { awk 'BEGIN {srand(); OFMT="%.17f"} {print rand(), $0}' "$@" |
               sort -k1,1n | cut -d ' ' -f2-; }
shuf() { perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@"; }
shuf() { python -c '
import sys, random, fileinput; from signal import signal, SIGPIPE, SIG_DFL;    
signal(SIGPIPE, SIG_DFL); lines=[line for line in fileinput.input()];   
random.shuffle(lines); sys.stdout.write("".join(lines))
' "$@"; }

See the bottom section for a Windows version of this function.

shuf() { ruby -e 'Signal.trap("SIGPIPE", "SYSTEM_DEFAULT");
                     puts ARGF.readlines.shuffle' "$@"; }

Performance comparison:

Note: These numbers were obtained on a late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3. While timings will vary with OS used, machine specs, awk implementation used (e.g., the BSD awk version used on OSX is usually slower than GNU awk and especially mawk), this should provide a general sense of relative performance.

Input file is a 1-million-lines file produced with seq -f 'line %.0f' 1000000.
Times are listed in ascending order (fastest first):

  • shuf
    • 0.090s
  • Ruby 2.0.0
    • 0.289s
  • Perl 5.18.2
    • 0.589s
  • Python
    • 1.342s with Python 2.7.6; 2.407s(!) with Python 3.4.2
  • awk + sort + cut
    • 3.003s with BSD awk; 2.388s with GNU awk (4.1.1); 1.811s with mawk (1.3.4);

For further comparison, the solutions not packaged as functions above:

  • sort -R (not a true shuffle if there are duplicate input lines)
    • 10.661s - allocating more memory doesn't seem to make a difference
  • Scala
    • 24.229s
  • bash loops + sort
    • 32.593s

Conclusions:

  • Use shuf, if you can - it's the fastest by far.
  • Ruby does well, followed by Perl.
  • Python is noticeably slower than Ruby and Perl, and, comparing Python versions, 2.7.6 is quite a bit faster than 3.4.1
  • Use the POSIX-compliant awk + sort + cut combo as a last resort; which awk implementation you use matters (mawk is faster than GNU awk, BSD awk is slowest).
  • Stay away from sort -R, bash loops, and Scala.

Windows versions of the Python solution (the Python code is identical, except for variations in quoting and the removal of the signal-related statements, which aren't supported on Windows):

  • For PowerShell (in Windows PowerShell, you'll have to adjust $OutputEncoding if you want to send non-ASCII characters via the pipeline):
# Call as `shuf someFile.txt` or `Get-Content someFile.txt | shuf`
function shuf {
  $Input | python -c @'
import sys, random, fileinput;
lines=[line for line in fileinput.input()];
random.shuffle(lines); sys.stdout.write(''.join(lines))
'@ $args  
}

Note that PowerShell can natively shuffle via its Get-Random cmdlet (though performance may be a problem); e.g.:
Get-Content someFile.txt | Get-Random -Count ([int]::MaxValue)

  • For cmd.exe (a batch file):

Save to file shuf.cmd, for instance:

@echo off
python -c "import sys, random, fileinput; lines=[line for line in fileinput.input()]; random.shuffle(lines); sys.stdout.write(''.join(lines))" %*
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • SIGPIPE doesn't exist on Windows so I used this simple one-liner instead: `python -c "import sys, random; lines = [x for x in sys.stdin.read().splitlines()] ; random.shuffle(lines); print(\"\n\".join([line for line in lines]));"` – elig Sep 28 '19 at 20:08
  • @elig: Thanks, but omitting `from signal import signal, SIGPIPE, SIG_DFL; signal(SIGPIPE, SIG_DFL);` from the original solution is sufficient, and retains the flexibility of also being able to pass filename _arguments_ - no need to change anything else (except for quoting) - please see the new section I've added at the bottom. – mklement0 Sep 29 '19 at 04:26
27

I use a tiny perl script, which I call "unsort":

#!/usr/bin/perl
use List::Util 'shuffle';
@list = <STDIN>;
print shuffle(@list);

I've also got a NULL-delimited version, called "unsort0" ... handy for use with find -print0 and so on.

PS: Voted up 'shuf' too, I had no idea that was there in coreutils these days ... the above may still be useful if your systems doesn't have 'shuf'.

NickZoic
  • 7,575
  • 3
  • 25
  • 18
23

Here is a first try that's easy on the coder but hard on the CPU which prepends a random number to each line, sorts them and then strips the random number from each line. In effect, the lines are sorted randomly:

cat myfile | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2- > myfile.shuffled
Ruggiero Spearman
  • 6,735
  • 5
  • 26
  • 37
  • 9
    UUOC. pass the file to awk itself. – ghostdog74 Jan 28 '10 at 11:30
  • 1
    Right, I debug with `head myfile | awk ...`. Then I just change it to cat; that's why it was left there. – Ruggiero Spearman Jan 28 '10 at 13:00
  • Don't need `-k1 -n` for sort, since the output of awk's `rand()` is a decimal between 0 and 1 and because all that matters is that it gets reordered somehow. `-k1` might help speed it up by ignoring the rest of the line, though the output of rand() should be unique enough to short-circuit the comparison. – bonsaiviking Mar 19 '14 at 14:00
  • @ghostdog74: Most so called useless uses of cat are actually useful for being consistent between piped commands and not. Better to keep the `cat filename |` (or `< filename |`) than remember how each single program takes file input (or not). – ShreevatsaR Aug 26 '14 at 18:25
  • 2
    shuf() { awk 'BEGIN{srand()}{print rand()"\t"$0}' "$@" | sort | cut -f2- ;} – Meow Jan 22 '15 at 03:27
  • @bonsaiviking: `-k1` is redundant, because it still sorts the line _as a whole_, because no _stop_ field is specified; to truly limit sorting to the first field, `-k1,1` is needed. However, using `-n` speeds up sorting noticeably. Thus, you should use `-k1,1 -n` (to be explicit) or, taking advantage of the fact that the sort key is the first field and that `sort` uses longest-prefix matching when detecting numbers, just `-n`. – mklement0 May 05 '15 at 22:38
16

here's an awk script

awk 'BEGIN{srand() }
{ lines[++d]=$0 }
END{
    while (1){
    if (e==d) {break}
        RANDOM = int(1 + rand() * d)
        if ( RANDOM in lines  ){
            print lines[RANDOM]
            delete lines[RANDOM]
            ++e
        }
    }
}' file

output

$ cat file
1
2
3
4
5
6
7
8
9
10

$ ./shell.sh
7
5
10
9
6
8
2
1
3
4
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • Nicely done, but in practice much slower than [the OP's own answer](http://stackoverflow.com/a/2153889/45375), which combines `awk` with `sort` and `cut`. For no more than several thousands line it doesn't make much of a difference, but with higher line counts it matters (the threshold depends on the `awk` implementation used). A slight simplification would be to replace lines `while (1){` and `if (e==d) {break}` with `while (e – mklement0 May 05 '15 at 22:01
11

A one-liner for python:

python -c "import random, sys; lines = open(sys.argv[1]).readlines(); random.shuffle(lines); print ''.join(lines)," myFile

And for printing just a single random line:

python -c "import random, sys; print random.choice(open(sys.argv[1]).readlines())," myFile

But see this post for the drawbacks of python's random.shuffle(). It won't work well with many (more than 2080) elements.

Community
  • 1
  • 1
scai
  • 20,297
  • 4
  • 56
  • 72
  • 2
    the "drawback" is not specific to Python. Finite PRNG periods could be workarounded by reseeding PRNG with entropy from the system like `/dev/urandom` does. To utilize it from Python: `random.SystemRandom().shuffle(L)`. – jfs Sep 24 '14 at 19:26
  • doesn't the join() need to be on '\n' so the lines get printed each in its own ? – elig Sep 28 '19 at 20:09
  • @elig: No, because `.readLines()` returns the lines _with_ a trailing newline. – mklement0 Sep 29 '19 at 04:36
9

Simple awk-based function will do the job:

shuffle() { 
    awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
}

usage:

any_command | shuffle

This should work on almost any UNIX. Tested on Linux, Solaris and HP-UX.

Update:

Note, that leading zeros (%06d) and rand() multiplication makes it to work properly also on systems where sort does not understand numbers. It can be sorted via lexicographical order (a.k.a. normal string compare).

Michał Šrajer
  • 30,364
  • 7
  • 62
  • 85
  • Good idea to package the OP's own answer as a function; if you append `"$@"`, it'll also work with _files_ as input. There is no reason to multiply `rand()`, because `sort -n` is capable of sorting decimal fractions. It is, however, a good idea to control `awk`'s output format, because with the default format, `%.6g`, `rand()` will output the occasional number in _exponential_ notation. While shuffling up to 1 million lines is arguably enough in practice, it's easy to support more lines without paying much of a performance penalty; e.g. `%.17f`. – mklement0 May 07 '15 at 11:30
  • 1
    @mklement0 I didn't notice OPs answer while writing mine. rand() is multiplied by 10e6 to make it work with solaris or hpux sort as far as I remember. Good idea with "$@" – Michał Šrajer May 08 '15 at 12:46
  • 1
    Got it, thanks; perhaps you could add this rationale for the multiplication to the answer itself; generally, according to POSIX, [`sort` should be able to handle decimal fractions](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html) (even with thousands separators, as I've just noticed). – mklement0 May 08 '15 at 12:53
8

Ruby FTW:

ls | ruby -e 'puts STDIN.readlines.shuffle'
hoffmanc
  • 614
  • 9
  • 16
  • 1
    Great stuff; If you use `puts ARGF.readlines.shuffle`, you can make it work with both stdin input and filename arguments. – mklement0 May 08 '15 at 14:34
  • 1
    Even shorter `ruby -e 'puts $<.sort_by{rand}'` — ARGF is already an enumerable, so we can shuffle the lines by sorting it by random values. – akuhn Jul 24 '15 at 03:08
8

A simple and intuitive way would be to use shuf.

Example:

Assume words.txt as:

the
an
linux
ubuntu
life
good
breeze

To shuffle the lines, do:

$ shuf words.txt

which would throws the shuffled lines to standard output; So, you've to pipe it to an output file like:

$ shuf words.txt > shuffled_words.txt

One such shuffle run could yield:

breeze
the
linux
an
ubuntu
good
life
kmario23
  • 57,311
  • 13
  • 161
  • 150
6

One liner for Python based on scai's answer, but a) takes stdin, b) makes the result repeatable with seed, c) picks out only 200 of all lines.

$ cat file | python -c "import random, sys; 
  random.seed(100); print ''.join(random.sample(sys.stdin.readlines(), 200))," \
  > 200lines.txt
Community
  • 1
  • 1
dfrankow
  • 20,191
  • 41
  • 152
  • 214
5

We have a package to do the very job:

sudo apt-get install randomize-lines

Example:

Create an ordered list of numbers, and save it to 1000.txt:

seq 1000 > 1000.txt

to shuffle it, simply use

rl 1000.txt
Tunaki
  • 132,869
  • 46
  • 340
  • 423
btwiuse
  • 2,585
  • 1
  • 23
  • 31
4

If like me you came here to look for an alternate to shuf for macOS then use randomize-lines.

Install randomize-lines(homebrew) package, which has an rl command which has similar functionality to shuf.

brew install randomize-lines

Usage: rl [OPTION]... [FILE]...
Randomize the lines of a file (or stdin).

  -c, --count=N  select N lines from the file
  -r, --reselect lines may be selected multiple times
  -o, --output=FILE
                 send output to file
  -d, --delimiter=DELIM
                 specify line delimiter (one character)
  -0, --null     set line delimiter to null character
                 (useful with find -print0)
  -n, --line-number
                 print line number with output lines
  -q, --quiet, --silent
                 do not output any errors or warnings
  -h, --help     display this help and exit
  -V, --version  output version information and exit
Ahmad Awais
  • 33,440
  • 5
  • 74
  • 56
3

This is a python script that I saved as rand.py in my home folder:

#!/bin/python

import sys
import random

if __name__ == '__main__':
  with open(sys.argv[1], 'r') as f:
    flist = f.readlines()
    random.shuffle(flist)

    for line in flist:
      print line.strip()

On Mac OSX sort -R and shuf are not available so you can alias this in your bash_profile as:

alias shuf='python rand.py'
Jeff Wu
  • 2,428
  • 1
  • 21
  • 25
2

If you have Scala installed, here's a one-liner to shuffle the input:

ls -1 | scala -e 'for (l <- util.Random.shuffle(io.Source.stdin.getLines.toList)) println(l)'
swartzrock
  • 729
  • 5
  • 6
  • Alluringly simple, but unless the Java VM must be started up anyway, that startup cost is considerable; doesn't perform well with large line counts either. – mklement0 May 08 '15 at 21:42
1

This bash function has the minimal dependency(only sort and bash):

shuf() {
while read -r x;do
    echo $RANDOM$'\x1f'$x
done | sort |
while IFS=$'\x1f' read -r x y;do
    echo $y
done
}
Meow
  • 4,341
  • 1
  • 18
  • 17
  • Nice bash solution that parallels the OP's own `awk`-assisted solution, but performance will be a problem with larger input; your use of a single `$RANDOM` value shuffles correctly only up to 32,768 input lines; while you could extend that range, it's probably not worth it: for instance, on my machine, running your script on 32,768 short input lines takes about 1 second, which is about 150 times as long as running `shuf` takes, and about 10-15 times as long as the OP's own `awk`-assisted solution takes. If you can rely on `sort` being present, `awk` should be there as well. – mklement0 May 08 '15 at 20:25
0

In windows You may try this batch file to help you to shuffle your data.txt, The usage of the batch code is

C:\> type list.txt | shuffle.bat > maclist_temp.txt

After issuing this command, maclist_temp.txt will contain a randomized list of lines.

Hope this helps.

Ayfan
  • 29
  • 6
0

Not mentioned as of yet:

  1. The unsort util. Syntax (somewhat playlist oriented):

    unsort [-hvrpncmMsz0l] [--help] [--version] [--random] [--heuristic]
           [--identity] [--filenames[=profile]] [--separator sep] [--concatenate] 
           [--merge] [--merge-random] [--seed integer] [--zero-terminated] [--null] 
           [--linefeed] [file ...]
    
  2. msort can shuffle by line, but it's usually overkill:

    seq 10 | msort -jq -b -l -n 1 -c r
    
agc
  • 7,973
  • 2
  • 29
  • 50
0

Another awk variant:

#!/usr/bin/awk -f
# usage:
# awk -f randomize_lines.awk lines.txt
# usage after "chmod +x randomize_lines.awk":
# randomize_lines.awk lines.txt

BEGIN {
  FS = "\n";
  srand();
}

{
  lines[ rand()] = $0;
}

END {
  for( k in lines ){
    print lines[k];
  }
}
biziclop
  • 14,466
  • 3
  • 49
  • 65