0

I have a large text file. I want to select n lines randomly, remove it from the original file and put it into the new file.

Solutions are given here but they dont remove the lines from the original file.

thanks

Community
  • 1
  • 1

1 Answers1

2

Create a file with 1 million lines:

perl -e 'for (1..1000000) { print "line $_ - and some data_$_\n" }' > large_file

Here is a perl script to sample the large file:

sample_size.pl

#!/usr/bin/env perl

use warnings;
use strict;

my ($filename, $n) = @ARGV;
$filename
    or die "usage: $0 filename sample_size";

-f $filename
    or die "Invalid filename '$filename'";
chomp(my ($word_count_lines) = `/usr/bin/wc -l $filename`);
my ($lines, undef) = split /\s+/, $word_count_lines;

die "Need to pass in sample size"
    unless $n;
my $sample_size = int $n;

die "Invalid sample size '$n', should in the between [ 0 - $lines ]"
    unless (0 < $sample_size and $sample_size < $lines);

# Pick some random line numbers
my %sample;
while ( keys %sample < $sample_size ) {
    $sample{ 1+int rand $lines }++;
}

open my $fh, $filename
    or die "Unable to open '$filename' for reading : $!";

open my $fh_sample, "> $filename.sample"
    or die "Unable to open '$filename.sample' for writing : $!";
open my $fh_remainder, "> $filename.remainder"
    or die "Unable to open '$filename.remainder' for writing : $!";

my $current_fh;
while (<$fh>) {
    my $line_number = $.;
    $current_fh = $sample{ $line_number } ? $fh_sample : $fh_remainder;
    # Write to correct file
    print $current_fh $_;
}
close $fh
    or die "Unable to finish reading '$filename' : $!";
close $fh_sample
    or die "Unable to finish writing '$filename.sample' : $!";
close $fh_remainder
    or die "Unable to finish writing '$filename.sample' : $!";

print "Original file '$filename' has $lines rows\n";
print "Created '$filename.sample' with $sample_size rows\n";
print "Created '$filename.remainder' with " . ($lines - $sample_size) . " rows\n";
print "Run 'mv $filename.remainder $filename' if you are happy with this result\n";

Run the script

$ perl ./sample_size.pl large_file 10

Output

Original file 'large_file' has 1000000 rows
Created 'large_file.sample' with 10 rows
Created 'large_file.remainder' with 999990 rows
Run 'mv large_file.remainder large_file' if you are happy with this result
xxfelixxx
  • 6,512
  • 3
  • 31
  • 38