Split file by equal parts based on count

Question

I have file (node_list.txt) which is having list of nodes.

nod_1
nod_2
nod_3
nod_4
nod_5

I have list of host ip address(this count may vary) and need to devide the node_list into equal number of parts and send those splitted node files to each of the hosts. host_ip1 host_ip2 host_ip3

Dividing of nodes in the file are based on number of host_ip's available.

Here in my example I should get:

node_list_file_1.txt
nod_1
nod_2

node_list_file_2.txt
nod_3
nod_4

node_list_file_3.txt
nod_5

My code looks like this:

print Dumper(\@list_of_hosts);

my $node_file = "node_list.txt";
open(NODE_FILE, "< $node_file") or die "can't open $node_file: $!";
my $count;
$count += tr/\n/\n/ while sysread(NODE_FILE, $_, 2 ** 16);
print "COUNT:$count\n";

my $res = $count / scalar @list_of_ips;

In $res I am getting the count how much number of lines should go to each of the file. But how to put this into file.

Open the output files, storing the handles in an array. Step through the input file, writing each line to the appropriate file based on the array. Keep going until finished. Note that you don't need to know how big the input file is (how many lines it contains); you only need to know how many output files you want. — Jonathan Leffler, Dec 26 '19 at 06:30
Also, you should avoid the old-fashioned `NODE_FILE` style of file handles and use lexically scoped file handles: `open my $fh, "<", $node_file or die;` — Jonathan Leffler, Dec 26 '19 at 06:35
To check what need be done when number of files to write doesn't evenly divide number of lines: for 10 lines to break into 3 files do you need lines per file as 4-4-2 or 4-3-3 ? — zdim, Dec 26 '19 at 08:42
@zdim, Or put differently, if there were 5 hosts and 26 nodes, 6-5-5-5-5 or 6-6-6-6-2? (Personally, I don't see how the latter is close to the "equal" the OP requested. But I'll give you the benefit of the doubt for now) — ikegami, Dec 26 '19 at 08:50
@ikegami Yes, an even better example. While they do indeed say "equal number of parts" (what I missed at first), that's a little thin for a spec so I asked for confirmation. — zdim, Dec 26 '19 at 08:53
@zdim : As per the calculation if there are 10 elements put into 3 files then 10/3=3. Eventually 4-3-3 is good to go. — vkk05, Dec 26 '19 at 08:57
vinodk89, I think @zdim is interested in knowing if 4-4-2 is also acceptable. — ikegami, Dec 26 '19 at 09:01
"_4-3-3 is good to go_" -- thank you, I've edited my post to account for that. (I still leave the 4-4-2 code upfront in case that it is perhaps as good for your purpose. If that 4-4-2 split is in fact useless I'll edit again) — zdim, Dec 26 '19 at 09:29

zdim · Answer 1 · 2019-12-29T06:42:29.620

This splits lines so that each file except the last receives the maximum equal number, whereby the last one gets the remainder. So with 10 lines to split over 3 files they'll go as 4-4-2.^†

use warnings;
use strict;
use feature 'say';
use autodie qw(open);

my @lines = <>;
my $num_files = 3;
my $lines_per_file = int @lines/$num_files;
$lines_per_file += 1  if @lines % $num_files;

my @chunks;
push @chunks, [ splice @lines, 0, $lines_per_file ] while @lines;

my @fhs_out = map { open my $fh, ">fout_$_.txt"; $fh } 1..$num_files;

for my $i (0..$#chunks) { 
    print {$fhs_out[$i]} $_ for @{$chunks[$i]};
};

Notes

The <> reads all lines from files submitted at the command line
If the number of files to write doesn't evenly divide the number of lines to split between them, we need one more line in each file (and the last one receives the remainder)
The array with lines is successively splice-ed, in order to generate chunks of lines that will go into one file each, so it ends up emptied
I open all needed output files and store filehandles into an array so to later conveniently write chunks of lines into their files. This is by no means necessary, as one can iterate over @chunks and open a file and write to it for each group ("chunk") of lines
When writing to a filehandle that need be evaluated from an expression any more complex that just a basic scalar we must have that in a block, like { $fhs_out[$i] }. From print

If you're storing handles in an array or hash, or in general whenever you're using any expression more complex than a bareword handle or a plain, unsubscripted scalar variable to retrieve it, you will have to use a block returning the filehandle value instead, [...]

See this post for another way and more discussion.

^† If the distribution of lines must be 4-3-3 in this case, so split as evenly as possible, the code above need be modified like

my $lines_per_file = int @lines/$num_files;
my $extra = @lines % $num_files;

my @chunks;
push @chunks,
     [ splice @lines, 0, $lines_per_file + ( $extra-- > 0 ? 1 : 0 ) ] 
         while @lines;

The rest is the same.

ikegami · Accepted Answer · 2019-12-26T09:36:55.243

2

my $num_buckets = 3;

my @lines = <>;

my $per_bucket = int( @lines / $num_buckets );
my $num_extras =      @lines % $num_buckets;

for my $bucket_num (0..$num_buckets-1) {
   my $num_lines = $per_bucket;
   if ($num_extras) {
      ++$num_lines;
      --$num_extras;
   }

   my $qfn = "node_list_file_${bucket_num}.txt";
   open(my $fh, '>', $qfn)
      or die("Can't create \"$qfn\": $!\n");

   $fh->print(splice(@lines, 0, $num_lines));
}

$per_bucket is the number of nodes per file.
$num_extras is how many files that have one extra node.

Note that the calculation of $num_lines can be condensed to the following (which I avoided for readability):

my $num_lines = $per_bucket + ( $num_extras-- > 0 );

The above loads the entire file into memory. The following is an alternative solution that doesn't:

my $num_buckets = 3;

my @fhs;
for my $bucket_num (1..$num_buckets) {
   my $qfn = "node_list_file_${bucket_num}.txt";
   open(my $fh, '>', $qfn)
      or die("Can't create \"$qfn\": $!\n");

   push @fhs, $fh;
}

$fhs[ ( $. - 1 ) % @fhs ]->print($_) while <>;

However, while it performs the requested task, the output isn't exactly as specified:

node_list_file_1.txt
--------------------
nod_1
nod_4

node_list_file_2.txt
--------------------
nod_2
nod_5

node_list_file_3.txt
--------------------
nod_3

edited Dec 26 '19 at 09:36

answered Dec 26 '19 at 07:51

ikegami

367,544
15
269
518

Thank you @ikegami. So, in your first solution I should store all the input file line contents to ```@lines```. In second solution can you pls elaborate what this condition means ```$fhs[ ( $. - 1 ) % @fhs ]->print($_) while <>;``` and how can I take input file? – vkk05 Dec 26 '19 at 08:12
`<>` is short for ``, which is short for `readline(ARGV)`, and `ARGV` is a special handle that reads from the files whose paths are in `@ARGV`, or from `STDIN` if `@ARGV` is empty. In short, it acts like virtually every unix program (e.g., `cat`, `grep`, etc) Feel free to use a different handle. – ikegami Dec 26 '19 at 08:15
... and `while <>` is short for `while defined($_ = <>)` – ikegami Dec 26 '19 at 08:31
In both the cases, if we mention ```$num_buckets = 3;``` then it will create 3 files obviously. But what if I have only 2 lines of data in ```node_list.txt``` say ```nod_1``` ```node_2``` and it should create 2 files only? Since its been creating 3rd file too with empty data in it. How to avoid it? – vkk05 Dec 30 '19 at 11:17
Using the first approach, simply check if the number of lines to print is zero. /// Using the second, it's a bit more complicated. You'd have to delay creating the file until it's needed. You'd use something like `$fh[$i] //= do { ... };` in the loop. – ikegami Dec 30 '19 at 20:37
Good idea. I am lasting the loop when ```$num_lines``` reaches to ```0```. i.e., ```last if($num_lines == 0);```. – vkk05 Dec 31 '19 at 05:52

Polar Bear · Answer 3 · 2019-12-26T10:11:55.417

-1

Perhaps following code comply with your requirements

use strict;
use warnings;

use feature 'say';

use Data::Dumper;

my $debug = 1;                          # $debug = 1 -- debug mode

my $node_file = "node_list.txt";        # input filename

my @hosts = qw(host_ip1 host_ip2 host_ip3); # Hosts to distribute between

my $num_hosts = @hosts;                 # Number of hosts to distribute between

open(my $fh, "<", $node_file) 
        or die "can't open $node_file: $!";

my @nodes =  <$fh>;                     # read input lines into @nodes array

chomp @nodes;                           # trim newline from each element @nodes array

close $fh;

print Dumper(\@nodes) if $debug;        # print @nodes content in debug mode

my $count = @nodes;                     # count number nodes in @nodes array

print "COUNT: $count lines in the input file\n";

# How many lines store in out files
my $lines_in_file = int($count/$num_hosts + 0.5);

my $lines_out   = $lines_in_file;       # how many line to output per file
my $file_index  = 1;                    # index for output filenames
my $filename    = "node_list_file_${file_index}.txt";

# open OUT file
open(my $out, ">", $filename)
        or die "Couldn't open $filename";

foreach my $node_name (@nodes) {        # process each element of @nodes array
    say $out $node_name;                # store node in OUT file

    $lines_out--;                       # decrease number of left lines for output

    if( $lines_out == 0 ) {             # all lines per file stored
        close $out;                     # close file

        $lines_out = $lines_in_file;    # reinitialize number of lines for output

        $file_index++;                  # increase index for filename
        $filename = "node_list_file_${file_index}.txt";

        open($out, ">", $filename)      # open new OUT file
            or die "Couldn't open $filename";
    }
}

close $out;                             # close OUT file

edited Dec 26 '19 at 10:11

answered Dec 26 '19 at 07:08

Polar Bear

6,762
1
5
12

Thank you. What if I have odd number of lines in input node file? Ex:If I have 11 nodes in input files and needed to put in 6 files (2 in each) and the final one should contain 1 node. Is it possible here? – vkk05 Dec 26 '19 at 07:44
Please avoid needlessly using global vars (`NODE_FILE`, `OUT`) – ikegami Dec 26 '19 at 08:06
Please avoid needlessly using 2-arg `open` – ikegami Dec 26 '19 at 08:06
There's no need to `chomp`. – ikegami Dec 26 '19 at 08:07
1

More importantly, it only works if there are exactly 5 or 6 nodes in the input file. (4 does 2/2/0 instead of 2/1/1, and 7 does 2/2/2 instead of 3/2/2.) – ikegami Dec 26 '19 at 08:12
@ikegami -- works fine for 11 lines in the input file, creates 6 files [5 files with 2 lines and last with 1 line only]. (NODE_FILE, OUT) -- I am old school, I did look into doc for **open** and indeed it is not recommended. If **chomp** isn't used then **Dumper(@nodes)** prints extra **"\n"** in it's output, and in output files between each line will be an empty line. Does such output matches desired by author? – Polar Bear Dec 26 '19 at 08:57
It should only create 3 files. The number of file to create is the fixed part (since the number of hosts is fixed), not the number of lines per file. (Sorry, the last comment described the problem incorrectly, but the problem is real.) – ikegami Dec 26 '19 at 08:57
@vinodk89 -- Is it difficult to add a few lines into input file and see what will be outcome? Do not afraid to experiment and observe what will be the result -- it is first rule of science **observe/experiment and learn**. – Polar Bear Dec 26 '19 at 08:59
Does original post states that there should be only 3 output files? Does it states what content is for **@list_of_ips**. I can not read the mind of poster and I have to improvise to show working code. – Polar Bear Dec 26 '19 at 09:01
@vinodk89 -- I have adjusted code to include **hosts to distribute between**. In future try to describe a problem in such way that would be easy to understand. – Polar Bear Dec 26 '19 at 09:40
@ikegami -- I had to use **say {$out} $node_name;** to printout into a file! I would say it is quite **strange** way. For me did not work **say $fh $node_name;** nor **say $fh, $node_name**. And I came across following `To use FILEHANDLE without a LIST to print the contents of $_ to it, you must use a bareword filehandle like FH , not an indirect one like $fh .` -- https://perldoc.perl.org/functions/say.html – Polar Bear Dec 26 '19 at 09:43
@ikegami -- perl -V Summary of my perl5 (revision 5 version 30 subversion 0) – Polar Bear Dec 26 '19 at 09:45
That just means you can't use `say $fh;` to mean `say $fh $_;`, but you can most definitely use `say $fh $_;`. (`perl -e'use feature qw( say ); $_="abc"; my $fh = \*STDOUT; say; say STDOUT; say $fh; say $fh $_;'`) – ikegami Dec 26 '19 at 09:45
(Please use backticks for code, but doubled asterisks.) – ikegami Dec 26 '19 at 09:48
@ikegami -- https://metacpan.org/pod/Perl6::Say -- Section **BUGS AND IRRITATIONS** Perl 5 can not reproduce with **say** feature of **print** (look at the bottom). – Polar Bear Dec 26 '19 at 09:48
Ok, I thought you were talking about the `say` operator. You even linked to the `say` operator in the initial comment. Why are you suddenly talking about some sub in some CPAN module? – ikegami Dec 26 '19 at 09:49
@ikegami -- I only started to use `say` and attempted to output into a file, right away it produced an empty file. Hmm, why is that? Started to read documentation and found that in my case I had to use `say {$fh} $data;` to direct data into a file. Now I each new perl script have to add `use feature 'say'` (annoyance), and how to print data without `"\n"` added? – Polar Bear Dec 26 '19 at 09:54
Re "*Started to read documentation and found that in my case I had to use `say {$fh} $data;` to direct data into a file.*", Again, no. That's not what the passage says (since the usage does provide something for `LIST`). See my earlier comment for more details. – ikegami Dec 26 '19 at 10:01
Re "*Now I each new perl script have to add use feature 'say' (annoyance), and how to print data without "\n" added?*", Or `use v5.10;` or `use MyCustom;` or use `CORE::say` (5.12+). But yes. For backwards compatibility reasons. – ikegami Dec 26 '19 at 10:02
@ikegami -- in my case `use feature 'say';` was not in the code, but `use Data::Dumper;`. The `say` in the code did not produce any error or warning -- as result output files was empty. Currently I tested with `use feature 'say';`, files got directed into them content. How should be considered such situation as it created some confusion? – Polar Bear Dec 26 '19 at 10:09
Please provide a minimal, runnable demonstation of the problem. ...Not in the comments, but [here](https://stackoverflow.com/questions/ask) – ikegami Dec 26 '19 at 10:15
@ikegami -- I tested with a minimal code and the error did not manifested itself. Ok, will try do it on original code (code for this question) and if the problem manifest itself I will submit new `say` related question. – Polar Bear Dec 26 '19 at 10:18
@ikegami -- I guess that it might be partially my mistake. I tried different variations and for sure `say $fh, $data;` does not store data in a file. I tried `say $fh $data` with out `use feature 'say';` but with `use Data::Dumper;` and the files was empty. Currently I did the same and files obtain their content -- I am puzzled as I did not change anything else. Hmm? – Polar Bear Dec 26 '19 at 10:24
@ikegami -- It would be nice if the doc page for `say` included an example with indirect filehandle `$fh`. https://perldoc.perl.org/functions/say.html – Polar Bear Dec 26 '19 at 10:25

Split file by equal parts based on count

3 Answers3