0

I would like to select at random N lines from a big file called input_file (6250000 lines,N=1250000), output the N lines in a new file called output_file, and delete the N lines from the original file input_file.

According to this post, the first two tasks can be achieved with :

sort -R input_file | head -n $N > output_file

How to delete the selected N lines from input_file ?

dada
  • 1,390
  • 2
  • 17
  • 40

2 Answers2

0

This does the job if we don't care about the ordering of the lines after deleting the N lines:

# shuffle the input file
sort -R input_file > shuffled
# select N lines
head -n $N shuffled > output_file
# delete the selected N lines
sed -i -e '1,<N>d' shuffled 

shuffled now contains lines in input_file that are not in output_file.

dada
  • 1,390
  • 2
  • 17
  • 40
0

Selecting N random lines is easy:

# select N random lines
sort -R input_file | head -n $N > output_file

If all lines in the input are unique, then you can remove the selected lines and preserve the order of the retained lines with:

grep -v -x -f output_file input_file > input_file.bak && mv input_file.bak input_file

If the lines in the input are not unique, you can preserve the order of the retained lines with a bit more work:

sort -R input_file > shuffled
head -n $N shuffled > output_file
tail -n +$((N+1)) shuffled > keep
grep -x -f keep input_file > input_file.bak && mv input_file.bak input_file
janos
  • 120,954
  • 29
  • 226
  • 236