How to efficiently get 10% of random lines out of the large file in Linux?

Question

I want to output random 10% lines of total lines of a file. For instance, file a has 1,000,000 lines then I want to output random 100,000 lines out of the file (100,000 being the 10% of 1,000,000) .

There is a easy to do this supposed that the file is small:

randomLine=`wc -l a | awk '{printf("%d\n",($1/10))}'`
sort -R a | head -n $randomLine

But using sort -R is very slow. It will perform a dedicated random computation. My file has 10,000,000 lines. Sorting takes too much time. Is there anyway to archive a less dedicated and not so random but efficient sampling?

Edit Ideas:

To sample a line every ten lines is acceptable. But I don't know how to do this with shell script.
Read line by line and if
```
echo $RANDOM%100 | bc
```

is greater than 20 than output the line (Using the number greater than 10 to ensure get no less than 10% of line) and once output 10% line then stop. But I don't know how to read line by line using shell script.

Edit Description

The reason I want to use shell script is that my file contains \r characters. The new line character in the file should be \n but readline() function in Python and Java regards \r and \n as new line character, which doesn't fit my need.

Would it be random enough to print a random line from every bunch of 10? — fedorqui, Feb 13 '14 at 12:43
If you're looking for a general idea; have a look at Reservoir Sampling. (no clue, how to adapt that as shell script, though). — qqilihq, Feb 13 '14 at 12:45
this should not be done in shell. If you insist, read line by line, and each time get an (evenly) distributed random number. Choose a threshold so that 90% of the random numbers are below that threshold (maybe some modulus m). Only print each line if the random number is over the threshold. (If you need *exactly* 10%, make a distribution over [1...number of lines] having 90% of lines below a threshold... you don't want to do that in shell) — Jo So, Feb 13 '14 at 12:51
You can do this in a single pass with a simple awk script. See my answer. — Jim Mischel, Feb 13 '14 at 20:44

score 4 · Answer 1 · edited May 23 '17 at 12:29

4

Let's create a random list of X numbers from 1 to Y. You can do it with:

shuf -i 1-Y -nX

In your case,

shuf -i 1-1000000 -n10000

Then you store it in a variable (space separated) and pass to awk, so that you print those line numbers:

awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' <(shuf -i 1-1000000 -n10000) file

Explanation

FNR==NR {a[$1]; next} loop through the shuf results and store them in a a[] array.
{if (FNR in a) print} if the line number of the second parameter (the file) is found in the array a[], print it.

Sample with Y=10, X=2

$ cat a
1 hello
2 i am
3 fe
4 do
5 rqui
6 and
7 this
8 is 
9 sample
10 text

$ awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' <(shuf -i 1-10 -n2) a
2 i am
9 sample

$ awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' <(shuf -i 1-10 -n2) a
4 do
6 and

Improvement

As plundra suggested in comments:

shuf -n $(( $(wc -l < $FILENAME) / 10 )) $FILENAME

edited May 23 '17 at 12:29

Community

1
1

answered Feb 13 '14 at 13:04

fedorqui

275,237
103
548
598

where this b comes from? I've never seen "NR in b" this syntax before... can you explain what b[a[i]]=a[i]} NR in b means? – Marcus Thornton Feb 13 '14 at 13:27
See my updated answer with some explanations. I hope it is clear, don't hesitate to ask for more clarification. I think this way is pretty fast and easy. – fedorqui Feb 13 '14 at 13:32
btw /bin/awk: Argument list too long will occur in my file. – Marcus Thornton Feb 13 '14 at 13:32
Havent' thought about that. Let's do it another way round: giving the `shuf` list as it was a file. You can find it in my updated answer. – fedorqui Feb 13 '14 at 13:45
I am still tying to understand the syntax. What does it mean by <(shuf -i 1-10 -n2) a. I've never seen this syntax and I though < is to redirect the standard input. But you used shuf here. And how come the output order is sorted. shuf function will return the number without sorting. – Marcus Thornton Feb 13 '14 at 14:04
It is an indirection and behaves the same as if you were giving a file to `awk`. So you can `shuf ... > shuf_file` and then call `awk '...' shuf_file your_file`. For further references you can check http://backreference.org/2010/02/10/idiomatic-awk/ -> "Two-file processing". – fedorqui Feb 13 '14 at 14:08
Ok. Let me digest answers. I got several great answers in the thread. – Marcus Thornton Feb 13 '14 at 14:33
2

Why not just `shuf -n $(( $(wc -l < $FILENAME) / 10 )) $FILENAME`? – plundra Feb 13 '14 at 15:01
@plundra, I did not know that `shuf` could have a file as parameter like that. It sounds great! If I were you, I would post as an answer, it is the best way to do it :) Now I blame myself for such a long `awk` version while just `shuf` sufficed! – fedorqui Feb 13 '14 at 15:04

score 1 · Accepted Answer · edited May 23 '17 at 11:58

1

I think this is the best way:

file=your file here
lines_in_file=`wc -l < $file`
lines_wanted=$(($lines_in_file/10))

shuf -n $lines_wanted $file

Another creative solution:

echo $RANDOM generates a random number between 0 and 32767

Then, you can do:

echo $(($RANDOM*100000/32767+1))

.. to obtain a random number between 1 and 100000 (as nwellnhof points out in comments below, it's not any number from 1 to 100000, but one of 32768 possible numbers between 1 and 100000, so it's kind of a projection...)

So:

file=your file here
lines_in_file=`wc -l $file | awk {'print $1'}`
lines_wanted=$(($lines_in_file/10))
for i in `seq 1 $lines_wanted`
 do line_chosen=$(($RANDOM*${lines_in_file}/32767+1))
sed "${line_chosen}q;d" $file
done

edited May 23 '17 at 11:58

Community

1
1

answered Feb 13 '14 at 13:02

Alex Jurado - Bitendian

1,027
6
9

That is an elegant and straightforward solution, but are you sure that doing 100k passes over the file with `sed` is faster than having `sort` read in the complete file once and shuffling it? – Damon Feb 13 '14 at 13:19
I'm sure because I've just tried and is fast indeed. And did the same with sort and takes too long I've just CTRL+C'd it. Ideally would be to ask sed to extract all the chosen lines at once. But I don't know how. Also, I like my solution because you are seeing lines from the first second, and you can interrupt whenever you can. – Alex Jurado - Bitendian Feb 13 '14 at 13:22
1

`$RANDOM*100000/32767+1` can produce only 32768 numbers between 1 and 100000, not the whole range. – nwellnhof Feb 13 '14 at 13:27
You are right... I can't squeeze $RANDOM anymore! < : ) Take in account that Marcus asked: (...)"less dedicated and not so random but efficient sampling"(...) – Alex Jurado - Bitendian Feb 13 '14 at 13:30
Just found out that shuf does it for me! Edited my answer accordingly, though I've not deleted the previous approach because I think is useful as a practising exercise. – Alex Jurado - Bitendian Feb 13 '14 at 13:41
shuf -n $lines_wanted $file indeed performed very fast. But how come? What kind of algorithm can output exact number of random lines with this fast speed? – Marcus Thornton Feb 13 '14 at 14:21
I'd bet it calculates the set of random line numbers and then asks for the corresponding lines in a single shot. – Alex Jurado - Bitendian Feb 13 '14 at 16:10

score 0 · Answer 3 · answered Feb 13 '14 at 13:19

0

I have this script that will give you roughly 1/x of the lines.

#!/usr/bin/perl -w

use strict;

my $ratio = shift;

while (<>) {
    print if ((rand) <= 1 / $ratio);
}

For a large enough $ratio, assuming a uniform distribution of rand's outputs.

Assuming you call this random_select_ratio.pl, run it like this to get 10% of the lines:

random_select_ratio.pl 10 my_file

or

cat my_file | random_select_ratio.pl 10

answered Feb 13 '14 at 13:19

Nathan Fellman

122,701
101
260
319

Thanks. This is what I wanted to do in python. But python readline will regard \r as the new line character. I'm not familiar with perl, so let me check if it will work when \r character is contained in a line. – Marcus Thornton Feb 13 '14 at 14:10

Jim Mischel · Answer 4 · 2014-02-13T17:05:34.280

Just run this awk script with the file as input.

BEGIN { srand() }{ if (rand() < 0.10) print $0; }

It's been a while since I used awk, but I do believe that should do it.

And in fact it does work exactly as expected. Approximately 10% of the lines are output. On my Windows machine using GNU awk, I ran:

awk "BEGIN { srand() }{ if (rand() < 0.10) print $0; }" <numbers.txt >nums.txt

numbers.txt contained the numbers 1 through 1,000,000, one per line. Over multiple runs, the file nums.txt typically contained about 100,200 items, which works out to 10.02%.

If there's a problem with what awk considers a line, you can always change the record separator. That is RS = "\n"; But that should be the default on Linux machine.

Digital Trauma · Answer 5 · 2014-02-13T17:54:14.753

Here's one way to do Edit idea 1. in bash:

while readarray -n10 a; do
    [ ${#a[@]} = 0 ] && break
    printf "%s" "${a[${RANDOM: -1:1}]}"
done < largefile.txt

Kinda slow, though it was about 2.5x faster than the sort -R method on my machine.

We use readarray to read from the input stream 10 lines at a time into an array. Then we use the last digit of $RANDOM as an index into that array and print the resulting line.

Using the readarray/printf combo should ensure the \r characters are passed through unmodified, as in the edited requirement.

How to efficiently get 10% of random lines out of the large file in Linux?

5 Answers5

Explanation

Sample with Y=10, X=2

Improvement

Linked