How to randomly sample lines, output them in a new file, and delete them from the original file?

Question

I would like to select at random N lines from a big file called input_file (6250000 lines,N=1250000), output the N lines in a new file called output_file, and delete the N lines from the original file input_file.

According to this post, the first two tasks can be achieved with :

sort -R input_file | head -n $N > output_file

How to delete the selected N lines from input_file ?

dada · Answer 1 · 2017-08-11T09:21:24.987

0

This does the job if we don't care about the ordering of the lines after deleting the N lines:

# shuffle the input file
sort -R input_file > shuffled
# select N lines
head -n $N shuffled > output_file
# delete the selected N lines
sed -i -e '1,<N>d' shuffled

shuffled now contains lines in input_file that are not in output_file.

edited Aug 11 '17 at 09:21

answered Aug 11 '17 at 09:10

dada

1,390
2
17
40

score 0 · Answer 2 · answered Aug 11 '17 at 09:20

Selecting N random lines is easy:

# select N random lines
sort -R input_file | head -n $N > output_file

If all lines in the input are unique, then you can remove the selected lines and preserve the order of the retained lines with:

grep -v -x -f output_file input_file > input_file.bak && mv input_file.bak input_file

If the lines in the input are not unique, you can preserve the order of the retained lines with a bit more work:

sort -R input_file > shuffled
head -n $N shuffled > output_file
tail -n +$((N+1)) shuffled > keep
grep -x -f keep input_file > input_file.bak && mv input_file.bak input_file

How to randomly sample lines, output them in a new file, and delete them from the original file?

2 Answers2