What's an easy way to read random line from a file?

Question

What's an easy way to read random line from a file in a shell script?

large file: http://stackoverflow.com/questions/29102589/get-random-lines-from-large-files-in-bash — Ciro Santilli OurBigBook.com, Nov 20 '15 at 09:58

score 395 · Answer 1 · edited Jul 10 '15 at 16:42

395

You can use shuf:

shuf -n 1 $FILE

There is also a utility called rl. In Debian it's in the randomize-lines package that does exactly what you want, though not available in all distros. On its home page it actually recommends the use of shuf instead (which didn't exist when it was created, I believe). shuf is part of the GNU coreutils, rl is not.

rl -c 1 $FILE

edited Jul 10 '15 at 16:42

rogerdpack

62,887
36
269
388

answered Jan 15 '09 at 19:30

i really like that shuf approach! – Johannes Schaub - litb Jan 15 '09 at 19:39
2

Thanks for the `shuf` tip, it's built-in in Fedora. – Cheng Dec 02 '10 at 02:52
Does this `r1` have any advantages? `shuf` seams to work perfectly! – Thomas Ahle Jun 10 '11 at 15:46
shuf is great as a drop-in replacement for head command, good to know – Tomasz Tybulewicz Jun 10 '13 at 07:45
5

Andalso, `sort -R` is definitely going to make one wait **a lot** if dealing with considerably huge files -- 80kk lines --, whereas, `shuf -n` acts quite instantaneously. – Rubens Jun 18 '13 at 06:56
23

You can get shuf on OS X by installing `coreutils` from Homebrew. Might be called `gshuf` instead of `shuf`. – Alyssa Ross Dec 27 '13 at 22:27
2

Similarly, you can use `randomize-lines` on OS X by `brew install randomize-lines; rl -c 1 $FILE` – Jamie Apr 09 '14 at 18:03
@Rubens: [the same question](http://stackoverflow.com/questions/9245638/select-random-lines-from-a-file-in-bash/9245733#comment40761869_9245733) – jfs Sep 24 '14 at 18:50
@J.F.Sebastian: [the same answer](http://stackoverflow.com/questions/9245638/select-random-lines-from-a-file-in-bash/9245733#comment40766527_9245733) – Rubens Sep 24 '14 at 21:13
@ThomasAhle, the Debian package summary for `r1`'s randomize-lines states *Users are recommended to use the shuf command instead which should be available by default. This package may be considered deprecated.* Therefore, `shuf` appears preferable. – Adam Katz Dec 17 '14 at 21:50
4

Note that `shuf` is part of [GNU Coreutils](https://en.wikipedia.org/wiki/GNU_Core_Utilities) and therefore won't necessarily be available (by default) on *BSD systems (or Mac?). @Tracker1's perl one-liner below is more portable (and by my tests, is slightly faster). – Adam Katz Dec 19 '14 at 21:49
Why is this answer in the bottom though it has the most upvotes? – kouton Mar 07 '15 at 06:55
@kouton are you sorting by age? – Tim Jul 08 '15 at 16:27
This is a cool command! Yet another wheel I've reinvented not knowing it already exists in my flavor of Unix! Thank you! – Sol Jul 08 '16 at 14:23
though this is not suitable for huge files... I'm getting a 'shuf: read error: Cannot allocate memory' on a 70GB file – jimijazz Oct 07 '16 at 00:30
This is a great answer. I would just like to point out that in case more than 1 line is needed, ``shuf`` and ``rl`` make *permutations* of lines, not random draws. I.e. if you want to draw k random lines, you will want to run ``shuf -n 1`` k times. This will draw from N^k possibilities instead of N!/(N-k)! possibilities, where N is the total number of lines. E.g., get 7 random lines from wordlist.txt: ``for n in {1..7}; do shuf -n1 wordlist.txt; done`` – sujeet Mar 09 '17 at 04:19
you can use process substitution if you don't want to give `shuf` a file: `shuf -n 1 <(echo -e "heads\ntails")` will randomly pick "heads" or "tails". Or just pipe to it: `echo -e "heads\ntails" | shuf -n 1` – pmarreck Oct 10 '22 at 18:23

score 74 · Answer 2 · edited Feb 11 '15 at 00:02

74

sort --random-sort $FILE | head -n 1

(I like the shuf approach above even better though - I didn't even know that existed and I would have never found that tool on my own)

edited Feb 11 '15 at 00:02

Sparhawk

1,581
1
19
29

answered Nov 10 '10 at 12:28

Thomas Vander Stichele

36,043
14
56
60

10

+1 I like it, but you may need a very recent `sort`, didn't work on any of my systems (CentOS 5.5, Mac OS 10.7.2). Also, useless use of cat, could be reduced to `sort --random-sort < $FILE | head -n 1` – Steve Kehlet Feb 16 '12 at 19:02
`sort -R <<< $'1\n1\n2' | head -1` is as likely to return 1 and 2, because `sort -R` sorts duplicate lines together. The same applies to `sort -Ru`, because it removes duplicate lines. – Lri Sep 15 '12 at 11:03
5

This is relatively slow, since the whole file needs to get shuffled by `sort` before piping it to `head`. `shuf` selects random lines from the file, instead and is much faster for me. – Bengt Nov 25 '12 at 17:33
1

@SteveKehlet while we're at it, `sort --random-sort $FILE | head` would be best, as it allows it to access the file directly, possibly enabling efficient parallel sorting – WaelJ Jun 06 '14 at 18:22
@WaelJ Good improvement! – Steve Kehlet Jun 09 '14 at 16:13
5

The `--random-sort` and `-R` options are specific to GNU sort (so they won't work with BSD or Mac OS `sort`). GNU sort learned those flags in 2005 so you need GNU coreutils 6.0 or newer (eg CentOS 6). – RJHunter Apr 09 '15 at 07:09
from Wikipedia: "this is not a full random shuffle because it will sort identical lines together" – janosdivenyi Apr 14 '15 at 10:58
@Bengt: nothing is written until `shuf` reads the whole file into memory. `sort` may work even if the file does not fit in memory. – jfs Sep 26 '15 at 00:59

score 74 · Answer 3 · answered Jan 16 '09 at 08:54

74

Another alternative:

head -$((${RANDOM} % `wc -l < file` + 1)) file | tail -1

answered Jan 16 '09 at 08:54

PolyThinker

5,152
21
22

28

${RANDOM} only generates numbers less than 32768, so don't use this for large files (for example the English dictionary). – Ralf Mar 13 '12 at 20:16
3

This does not give you the precise same probability for every line, due to the modulo operation. This does barely matter if the file length is << 32768 (and not at all if it divides that number), but maybe worth noting. – Anaphory Mar 21 '14 at 17:58
11

You can extend this to 30-bit random numbers by using `(${RANDOM} << 15) + ${RANDOM}`. This significantly reduces the bias and allows it to work for files containing up to 1 billion lines. – nneonneo Jun 19 '15 at 05:42
@nneonneo: Very cool trick, though according to this link it should be OR'ing the ${RANDOM}'s instead of PLUS'ing http://stackoverflow.com/a/19602060/293064 – Jay Taylor Jul 12 '15 at 01:54
`+` and `|` are the same since `${RANDOM}` is 0..32767 by definition. – nneonneo Jul 12 '15 at 07:12
There's a heavy performance penalty to this, since it needs to count lines to be sure it's reading to the right point. – Charles Duffy Mar 19 '18 at 22:35

score 31 · Answer 4 · edited Apr 11 '19 at 16:48

31

This is simple.

cat file.txt | shuf -n 1

Granted this is just a tad slower than the "shuf -n 1 file.txt" on its own.

edited Apr 11 '19 at 16:48

codeforester

39,467
16
112
140

answered May 23 '16 at 07:01

Yokai

1,170
13
17

2

Best answer. I didn't know about this command. Note that `-n 1` specifies 1 line, and you can change it to more than 1. `shuf` can be used for other things too; I just piped `ps aux` and `grep` with it to randomly kill processes partially matching a name. – sudo Jan 18 '17 at 22:53

score 20 · Answer 5 · edited Apr 11 '19 at 16:48

20

perlfaq5: How do I select a random line from a file? Here's a reservoir-sampling algorithm from the Camel Book:

perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' file

This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

edited Apr 11 '19 at 16:48

codeforester

39,467
16
112
140

answered Jan 15 '09 at 19:06

Tracker1

19,103
12
80
106

1

Just for the purposes of inclusion (in case the referred site goes down), here's the code that Tracker1 pointed to: "cat filename | perl -e 'while (<>) { push(@_,$_); } print @_[rand()*@_];';" – Anirvan Jan 15 '09 at 19:16
3

This is a useless use of cat. Here's a slight modification of the code found in perlfaq5 (and courtesy of the Camel book): perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' filename – Mr. Muskrat Jan 15 '09 at 21:55
err... the linked site, that is – Nathan Fellman May 22 '09 at 04:48
I just benchmarked an N-lines version of this code against `shuf`. The perl code is very slightly faster (8% faster by user time, 24% faster by system time), though anecdotally I've found the perl code "seems" less random (I wrote a jukebox using it). – Adam Katz Dec 17 '14 at 21:59
2

More food for thought: [`shuf` stores the whole input file in memory](https://stackoverflow.com/questions/9245638/select-random-lines-from-a-file-in-bash/9245733#comment40771587_9245733), which is a horrible idea, while this code only stores one line, so the limit of this code is a line count of INT_MAX (2^31 or 2^63 depending on your arch), assuming any of its selected potential lines fits in memory. – Adam Katz Dec 19 '14 at 21:58
here's the awk equivalent. either of these answers (perl or awk) are better than the accepted for - portability, speed, and ability to manage huge files easily. `awk 'BEGIN{srand()}{rand()*NR<1&&l=$0}END{print l}' file` or `some_input | awk 'BEGIN{srand()}{rand()*NR<1&&l=$0}END{print l}'` – keithpjolley Apr 19 '20 at 17:18

Paolo Tedesco · Answer 6 · 2009-01-16T08:26:38.423

11

using a bash script:

#!/bin/bash
# replace with file to read
FILE=tmp.txt
# count number of lines
NUM=$(wc - l < ${FILE})
# generate random number in range 0-NUM
let X=${RANDOM} % ${NUM} + 1
# extract X-th line
sed -n ${X}p ${FILE}

edited Jan 16 '09 at 08:26

answered Jan 15 '09 at 19:12

Paolo Tedesco

55,237
33
144
193

1

Random can be 0, sed needs 1 for the first line. sed -n 0p returns error. – asalamon74 Jan 15 '09 at 19:20
mhm - how about $1 for "tmp.txt" and $2 for NUM ? – blabla999 Jan 15 '09 at 19:22
but even with the bug worth a point, as it does not need perl or python and is as efficient as you can get (reading the file exactly twice but not into memory - so it would work even with huge files). – blabla999 Jan 15 '09 at 19:28
@asalamon74: thanks @blabla999: if we make a function out of it, ok for $1, but why not computing NUM? – Paolo Tedesco Jan 15 '09 at 19:28
Changing the sed line to: head -${X} ${FILE} | tail -1 should do it – JeffK Jan 15 '09 at 19:34
useless use of cat detected, wc happily takes files directly – Hasturkun Jan 15 '09 at 21:00
@Hasturkun: beware - the output of wc depends on whether it reads stdin or a file name off its command line. Granted, 'wc -l < $FILE' would be OK; using 'wc -l $FILE' (no redirection) would be a bug. – Jonathan Leffler Jan 16 '09 at 08:06
@Hasturkun & J.Leffler: the cat was meant to avoid wc printing the file name. Fixed with the 'wc -l < $FILE' suggestion, thanks – Paolo Tedesco Jan 16 '09 at 08:26
The variable names should be quoted, especially `$FILE`. The curly braces are superfluous here. I recommend using lowercase or mixed-case variable names to avoid potential name collisions with shell or environment variables. – Dennis Williamson Oct 28 '11 at 14:22
If a file has 32769 or more lines, the last ones are never selected. `wc - l` shouldn't have a space. – Lri Sep 15 '12 at 11:12

score 4 · Answer 7 · answered Jan 15 '09 at 19:17

4

Single bash line:

sed -n $((1+$RANDOM%`wc -l test.txt | cut -f 1 -d ' '`))p test.txt

Slight problem: duplicate filename.

answered Jan 15 '09 at 19:17

asalamon74

6,120
9
46
60

3

slighter problem. performing this on /usr/share/dict/words tends to favor words starting with "A". Playing with it, I'm at about 90% "A" words to 10% "B" words. None starting with numbers yet, which make up the head of the file. – bibby Sep 30 '10 at 05:01
`wc -l < test.txt` avoids having to pipe to `cut`. – fedorqui May 11 '15 at 17:56

score 3 · Answer 8 · edited Sep 24 '14 at 19:09

3

Here's a simple Python script that will do the job:

import random, sys
lines = open(sys.argv[1]).readlines()
print(lines[random.randrange(len(lines))])

Usage:

python randline.py file_to_get_random_line_from

edited Sep 24 '14 at 19:09

jfs

399,953
195
994
1,670

answered Jan 15 '09 at 19:07

Adam Rosenfield

390,455
97
512
589

1

This doesn't quite work. It stops after a single line. To make it work, I did this: `import random, sys lines = open(sys.argv[1]).readlines() ` for i in range(len(lines)): rand = random.randint(0, len(lines)-1) print lines.pop(rand), – Jed Daniels Jan 14 '11 at 20:13
Stupid comment system with crappy formatting. Didn't formatting in comments work once upon a time? – Jed Daniels Jan 14 '11 at 20:14
randint is inclusive therefore `len(lines)` may lead to IndexError. You could use `print(random.choice(list(open(sys.argv[1]))))`. There is also memory efficient [reservoir sampling algorithm](http://askubuntu.com/a/527778/3712). – jfs Sep 24 '14 at 19:08
2

Quite space hungry; consider a 3TB file. – Michael Campbell May 27 '15 at 15:43
@MichaelCampbell: [reservoir sampling algorithm](http://stackoverflow.com/a/32792504/4279) that I've mentioned above may work with 3TB file (if line size is limited). – jfs Sep 26 '15 at 01:02
Using [py](https://github.com/Russell91/pythonpy) is nice. `-l` assigns incoming lines to a list, `l`. `py` auto-imports stdlib modules. so you can do `cat $FILE | py -l "random.choice(l)"`. Try it: `python -m this | py -l "random.choice(l)"` ... erm actually just `py this | py -l "random.choice(l)"` ;) – floer32 Jan 05 '16 at 21:23

score 2 · Answer 9 · answered Sep 04 '13 at 06:43

2

Another way using 'awk'

awk NR==$((${RANDOM} % `wc -l < file.name` + 1)) file.name

answered Sep 04 '13 at 06:43

Baskar

1,439
12
17

2

That uses awk and bash (`$RANDOM` is a [bashism](https://en.wikipedia.org/wiki/Bashism)). Here is a pure awk (mawk) method using the same logic as @Tracker1's cited perlfaq5 code above: `awk 'rand() * NR < 1 { line = $0 } END { print line }' file.name` (wow, it's even *shorter* than the perl code!) – Adam Katz Dec 19 '14 at 21:33
That code must read the file (`wc`) in order to get a line count, then must read (part of) the file again (`awk`) to get the content of the given random line number. I/O will be far more expensive than getting a random number. My code reads the file once only. The issue with awk's `rand()` is that it seeds based on seconds, so you'll get duplicates if you run it consecutively too fast. – Adam Katz Dec 19 '14 at 21:41

jrjc · Answer 10 · 2015-11-02T14:52:38.727

A solution that also works on MacOSX, and should also works on Linux(?):

N=5
awk 'NR==FNR {lineN[$1]; next}(FNR in lineN)' <(jot -r $N 1 $(wc -l < $file)) $file

Where:

N is the number of random lines you want
NR==FNR {lineN[$1]; next}(FNR in lineN) file1 file2 --> save line numbers written in file1 and then print corresponding line in file2
jot -r $N 1 $(wc -l < $file) --> draw N numbers randomly (-r) in range (1, number_of_line_in_file) with jot. The process substitution <() will make it look like a file for the interpreter, so file1 in previous example.

peak · Answer 11 · 2018-03-19T22:26:49.677

Using only vanilla sed and awk, and without using $RANDOM, a simple, space-efficient and reasonably fast "one-liner" for selecting a single line pseudo-randomly from a file named FILENAME is as follows:

sed -n $(awk 'END {srand(); r=rand()*NR; if (r<NR) {sub(/\..*/,"",r); r++;}; print r}' FILENAME)p FILENAME

(This works even if FILENAME is empty, in which case no line is emitted.)

One possible advantage of this approach is that it only calls rand() once.

As pointed out by @AdamKatz in the comments, another possibility would be to call rand() for each line:

awk 'rand() * NR < 1 { line = $0 } END { print line }' FILENAME

(A simple proof of correctness can be given based on induction.)

Caveat about `rand()`

"In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk."

-- https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

See [the comment I posted a year before this answer](https://stackoverflow.com/questions/448005/whats-an-easy-way-to-read-random-line-from-a-file-in-unix-command-line#comment43573811_18607080), which has a simpler awk solution that doesn't require sed. Also note my caveat about awk's random number generator, which seeds at whole seconds. — Adam Katz, Mar 19 '18 at 18:40

Ken Roy · Answer 12 · 2017-06-15T13:05:22.150

0

#!/bin/bash

IFS=$'\n' wordsArray=($(<$1))

numWords=${#wordsArray[@]}
sizeOfNumWords=${#numWords}

while [ True ]
do
    for ((i=0; i<$sizeOfNumWords; i++))
    do
        let ranNumArray[$i]=$(( ( $RANDOM % 10 )  + 1 ))-1
        ranNumStr="$ranNumStr${ranNumArray[$i]}"
    done
    if [ $ranNumStr -le $numWords ]
    then
        break
    fi
    ranNumStr=""
done

noLeadZeroStr=$((10#$ranNumStr))
echo ${wordsArray[$noLeadZeroStr]}

edited Jun 15 '17 at 13:05

answered Jun 15 '17 at 13:00

Ken Roy

915
1
10
15

Since $RANDOM generates numbers less than the number of words in /usr/share/dict/words, which has 235886 (on my Mac anyway), I just generate 6 separate random numbers between 0 and 9 and string them together. Then I make sure that number is less than 235886. Then remove leading zeros to index the words that I stored in the array. Since each word is its own line this could easily be used for any file to randomly pick a line. – Ken Roy Jun 15 '17 at 13:01

score 0 · Answer 13 · answered Aug 23 '17 at 07:41

Here is what I discovery since my Mac OS doesn't use all the easy answers. I used the jot command to generate a number since the $RANDOM variable solutions seems not to be very random in my test. When testing my solution I had a wide variance in the solutions provided in the output.

  RANDOM1=`jot -r 1 1 235886`
   #range of jot ( 1 235886 ) found from earlier wc -w /usr/share/dict/web2
   echo $RANDOM1
   head -n $RANDOM1 /usr/share/dict/web2 | tail -n 1

The echo of the variable is to get a visual of the generated random number.

What's an easy way to read random line from a file?

13 Answers13

Caveat about `rand()`

Linked

Related

What's an easy way to read random line from a file?

13 Answers13

Caveat about rand()

Linked

Related

Caveat about `rand()`