I want to output random 10% lines of total lines of a file. For instance, file a has 1,000,000 lines then I want to output random 100,000 lines out of the file (100,000 being the 10% of 1,000,000) .
There is a easy to do this supposed that the file is small:
randomLine=`wc -l a | awk '{printf("%d\n",($1/10))}'`
sort -R a | head -n $randomLine
But using sort -R is very slow. It will perform a dedicated random computation. My file has 10,000,000 lines. Sorting takes too much time. Is there anyway to archive a less dedicated and not so random but efficient sampling?
Edit Ideas:
- To sample a line every ten lines is acceptable. But I don't know how to do this with shell script.
Read line by line and if
echo $RANDOM%100 | bc
is greater than 20 than output the line (Using the number greater than 10 to ensure get no less than 10% of line) and once output 10% line then stop. But I don't know how to read line by line using shell script.
Edit Description
The reason I want to use shell script is that my file contains \r characters. The new line character in the file should be \n but readline() function in Python and Java regards \r and \n as new line character, which doesn't fit my need.