360

In a Bash script, I want to pick out N random lines from input file and output to another file.

How can this be done?

codeforester
  • 39,467
  • 16
  • 112
  • 140
user121196
  • 30,032
  • 57
  • 148
  • 198
  • Sort the file randomly and pick N first lines. – Piotr Praszmo Feb 12 '12 at 01:32
  • Also see http://stackoverflow.com/questions/12354659/how-to-select-random-lines-from-a-file. – Asclepius Oct 12 '12 at 21:18
  • 35
    this is not a duplicate -- he wants N lines vs 1 line. – OneSolitaryNoob May 08 '15 at 19:18
  • 2
    related: [Randomly Pick Lines From a File Without Slurping It With Unix](http://stackoverflow.com/q/692312/4279) – jfs Sep 26 '15 at 01:38
  • 1
    I disagree with `sort -R` as it does a lot of excess work, particularly for long files. You can use `$RANDOM`, `% wc -l`, `jot`, `sed -n` (à la https://stackoverflow.com/a/6022431/563329), and bash functionality (arrays, command redirects, etc) to define your own `peek` function which will actually run on 5,000,000-line files. – isomorphismes Mar 05 '19 at 22:42

8 Answers8

822

Use shuf with the -n option as shown below, to get N random lines:

shuf -n N input > output
DomainsFeatured
  • 1,426
  • 1
  • 21
  • 39
dogbane
  • 266,786
  • 75
  • 396
  • 414
  • 8
    If you just need a random set of lines, not in a random order, then shuf is very inefficient (for big file): better is to do reservoir sampling, as in [this answer](https://stackoverflow.com/a/692401/933228). – petrelharp Sep 06 '15 at 23:24
  • 6
    I ran this on a 500M row file to extract 1,000 rows and it took 13 min. The file had not been accessed in months, and is on an Amazon EC2 SSD Drive. – T. Brian Jones Mar 04 '16 at 18:26
  • 1
    so is this in essence more random that `sort -R` ? – Mona Jalal Mar 09 '17 at 03:06
  • 1
    @MonaJalal nope just faster, since it doesn't have to compare lines at all. – rogerdpack May 15 '17 at 17:20
  • 2
    Does it eventually yield the same line more than once? – Frederick Nord Mar 27 '18 at 20:21
  • 1
    @FrederickNord You might get repeated values when there are repeated lines, but the same line is not yielded twice. – ByIvo Jul 08 '20 at 23:47
  • 1
    `shuf file.txt` outputs randomly the lines to standard output. ----- `shuf -r -n 5 file.txt` - The flag `-r` allows the command to yield repeated lines; Don't forget to set the number of lines you want to be printed using the `-n` flag, otherwise it will output forever. ------ `sort -R` will shuffle, but it will also group identical keys. So if you have repeated lines but you don't want repeated values, pipe it with the `uniq` command. e.g. `sort -R file.txt | uniq` – ByIvo Jul 09 '20 at 00:03
  • But this will also reorder the lines. – Leo Oct 02 '20 at 06:31
  • with regards to the sort issue just pipe it to sort, extract 1000 lines from 557344 lines took 0,019s) `$ wc -l list_n_sorted` 557344 list_n_sorted` `$ time shuf -n 1000 list_n_sorted | sort -n > test_file` `real 0m0.019s` `user 0m0.010s` `sys 0m0.012s` ` – Sruli Dec 07 '20 at 12:50
  • time shuf -n 1000 500M.txt -> 0m21.447s in a very mediocre SSD (127MB/s) – r_31415 Mar 02 '22 at 23:48
196

Sort the file randomly and pick first 100 lines:

lines=100
input_file=/usr/share/dict/words

# This is the basic selection method
<$input_file sort -R | head -n $lines

# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines

# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines

Of course <$input_file can be replaced with any piped standard input. This (sort -R and $'...\t...' to get sed to match tab chars) works with GNU/Linux and BSD/macOS.

Bruno Bronosky
  • 66,273
  • 12
  • 162
  • 149
user881480
  • 5,005
  • 6
  • 32
  • 31
  • 49
    `sort` actually sorts identical lines together, so if you may have duplicate lines and you have `shuf` (a gnu tool) installed, it's better to use it for this. – Kevin Feb 12 '12 at 03:59
  • 26
    Andalso, this is definitely going to make you wait **a lot** if you have a considerably huge file -- 80kk lines --, whereas, `shuf -n` acts quite instantaneously. – Rubens Jun 18 '13 at 06:54
  • 28
    sort -R is not available under Mac OS X (10.9) – Mirko Ebert Jun 23 '14 at 15:27
  • @Rubens: why do you think there should be significant time performance difference? `shuf` uses `O(n)`-time while `sort -R` might use `O(n * log n)` algorithm. Even if `n=1000000`; log(n) ~10. – jfs Sep 24 '14 at 18:43
  • 3
    @tfb785: `sort -R` is probably GNU option, install GNU coreutils. btw, `shuf` is also part of coreutils. – jfs Sep 24 '14 at 18:44
  • @J.F.Sebastian For two reasons, actually: first, I ran the code, and it takes **much** longer; second, if your file is reasonably large, which was my case, `sort` dives into external memory, sorting chunks of the file, and later merging them. Complexity hides complexity! :D – Rubens Sep 24 '14 at 21:09
  • @Rubens: `shuf` doesn't work **at all** if the file is so large that it requires external memory. *"I ran the code, and it takes much longer;"* Could you share the code with the corresponding input? – jfs Sep 25 '14 at 00:50
  • 1
    @J.F.Sebastian The code: `sort -R input | head -n `. The input file was 279GB, with 2bi+ lines. Can't share it, though. Anyway, the point is you can keep *some* lines in memory with shuffle to do the random selection of what to output. Sort is going to sort the **entire** file, regardless of what your needs are. – Rubens Sep 25 '14 at 02:01
  • @Rubens in other words, your claim that `sort -R` is slower than `shuf` is *unsubstantiated* because `shuf` tries to load the whole file in memory and it would be difficult for 279GB file. – jfs Sep 25 '14 at 02:06
  • 1
    @J.F.Sebastian True that. `shuf` does indeed load the whole data into memory. Just recalled I actually had to write an `sshuf`, which allows for framed random-picking of lines. This I can [share](http://pastebin.com/Ay1iZBAD). Anyway, `sort -R` itself would take too long to finish, and would still be a needless overhead (in memory or not). – Rubens Sep 25 '14 at 02:19
  • @J.F.Sebastian **Correction**: *`sort -R` itself would take too long to finish, and would still be a needless overhead (in memory or not)* **in case** a framed-shuffle suffices. Other than that, disk would be the bottleneck, and operating in `O(n)` or `O(n log n)` in memory would indeed make no much difference. Framed-shuffles was enough for my needs. – Rubens Sep 25 '14 at 02:42
  • 1
    As everyone said, this approach is only useful when using an small file. For files with more than 1M lines use shuf. – ejoncas Feb 16 '16 at 04:53
  • Doesn't work on mac, since `-R` is not a valid option. – chovy Sep 08 '16 at 08:35
  • FWIW shuf for me is able to give random lines from a file much too large to fit all into RAM, and uses only about 1M RAM itself while doing so. So while I think it might read the entire file, I don't think it reads it into memory all at once. FWIW. In terms of "why is it slower if it's O(n) vs. O(n*logn)" there are other things at play, shuf doesn't actually compare lines so can run faster. – rogerdpack May 15 '17 at 17:14
  • Worth noting that `sort` and `head` read from and output into standard output if no input or output files are specified, so one can pipe input into either command. – Patrick Dark May 30 '19 at 11:42
28

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

Challenge accepted...

EDIT: I beat my own record

powershuf did it in 0.047 seconds

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.

Gitlab Repo

Old attempt

First I needed a file of 78.000.000.000 lines:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billion newlines ;-)

Now for the shuf part:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

Python is what I regularly use so that's what I'll use to make this faster:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

I know it can get faster but I'll leave some room to give others a try.

Line counter source: Luther Blissett

  • 9
    Well, according to your description of powershuf's inner functionning, it looks like it is just randomish. Using a file with just two lines, one being 1 character long, the other being 20 characters long, I expect both lines to be choosen with equal chances. This doesn't seem to be the case with your program. – xhienne Jun 26 '20 at 23:26
  • There was an issue with files shorter than 4KB and some other math mistakes that made it horrible with small files. I fixed them for as far as I could find the issues, please give it another try. – Stein van Broekhoven Aug 01 '20 at 20:46
  • 2
    Hi Stein. It doesn't seem to work. Did you test it the way I suggested in my above comment? Before making something quicker than shuf, I reckon you should focus on making something that works as accurately as shuf. I really doubt anyone can beat shuf with a python program. BTW, unless you use the `-r` option, shuf doesn't output the same line twice, and of course this takes additional processing time. – xhienne Aug 01 '20 at 21:47
  • 1
    Why does powershuf discard the first line? Can it ever pick the very first line? It seems to also funnel the search in a weird way: if you have 10 lines too long, then 1 line of valid length, then 5 lines and another line of valid length, then the iteration will find the 10 lines more often than the 5, and funnel about two thirds of the time into the first valid line. The program doesn't promise this, but it would make sense to me if the lines were effectively filtered by length and then random lines were chosen from that set. – Lupilum Aug 18 '21 at 05:48
  • 1
    The question is how to get random lines from a text file in a bash script, not how to write a Python script. – dannyman Feb 09 '22 at 00:40
9

My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed

time \
awk 'BEGIN  {srand()} 
     !/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt

# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total
Merlin
  • 1,780
  • 1
  • 18
  • 20
  • 2
    This is brilliant--and super fast. – abalter Mar 25 '21 at 19:09
  • Randomly sample select _approximately_ 5% of lines in file. Law of large numbers will make it close, but since each line is decided independently, there is no way to guarantee it will actually be 5% of lines. – Amadan Oct 06 '22 at 04:38
1
seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'
Andelf
  • 603
  • 6
  • 10
0

Just for completeness's sake and because it's available from Arch's community repos: there's also a tool called shuffle, but it doesn't have any command line switches to limit the number of lines and warns in its man page: "Since shuffle reads the input into memory, it may fail on very large files."

Sixtyfive
  • 1,150
  • 8
  • 19
-1
# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
    N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

    N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

    N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

    # vector to have the 0 (fail) with size of N_t_m_d 
    echo '0' > vector_0.temp
    for i in $(seq 1 1 $N_t_m_d); do
            echo "0" >> vector_0.temp
    done

    # vector to have the 1 (success) with size of desired number of lines
    echo '1' > vector_1.temp
    for i in $(seq 1 1 $N_d_m_1); do
            echo "1" >> vector_1.temp
    done

    cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

    paste -d" " rand_vector.temp $1 |
    awk '$1 != 0 {$1=""; print}' |
    sed 's/^ *//' > sampled_file.txt # file with the sampled lines

    rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"
andrec
  • 168
  • 3
  • 8
-1

In the below 'c' is the number of lines to select from the input. Modify as needed:

#!/bin/sh

gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }

' "$@"
  • 1
    This does not guarantee that eactly `c` lines are selected. At best you can say that the average number of lines being selected is `c`. – user1934428 Jun 14 '22 at 05:44
  • That is incorrect: c/NR will be >= 1 (larger than any possible value of rand() ) for the first c lines, thus filling lines[]. x++ % c forces lines[] to c entries, assuming there are at least c lines in the input – user19322235 Jun 14 '22 at 17:38
  • Right, `c/NR` will be **guaranteed** to be larger than any value produced from `rand` for the **first c lines**. **After** that, it may or may not be larger than `rand`. Therefore we can say that `lines` in the end contains **at least** c entries, and in general more than that, i.e. **not** exactly c entries. Furthermore, the first c lines of the file are always picked, so the whole selection is not what could be called a **random** pick. – user1934428 Jun 15 '22 at 06:00
  • 1
    uh, x++ % c constrains lines[] to indices 0 to c-1. Of course, the first c inputs initially fill lines[], which are replaced in round robin fashion when the random condition is met. A small change (left as an exercise for the reader) could be made to randomly replace entries in lines[], rather than in a round-robin. – user19322235 Jun 15 '22 at 17:11