How can I read multiple tab-separated files without using a lot of while loops

Question

I have bunch of input files around 200MB which I need to read in Perl, extract specific information, and write it into a new file for each of those files. How can I do it without using a lot of while loops.

Each input file is tab-separated like this. The fields are ACME A, 0, 2

In every file I want to obtain then third column

ACME A  0   2
ACME A  1   0
ACME A  2   0
ACME A  3   0
ACME A  4   0
ACME A  5   0
ACME A  6   0

Here is my code so far;

#! /usr/bin/perl -w

#compiler profilleri

use strict;
use warnings;

sub trim($) {
    my $string = shift;
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;    #/ turn off wrong syntax highlight
    return $string;
}

#dosya locationları

my $input_file   = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_01.txt";
my $input_file1  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_06.txt";
my $input_file2  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_11.txt";
my $input_file3  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_16.txt";
my $input_file4  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_21.txt";
my $input_file5  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_26.txt";
my $input_file6  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_31.txt";
my $input_file7  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_36.txt";
my $input_file8  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_41.txt";
my $input_file9  = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_46.txt";
my $input_file10 = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_51.txt";
my $input_file11 = "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_56.txt";

my $output_file = "C:/Perl64/output/denemecik.txt";

#komutlar######

my $ne;
my @cc_type;
my @cc_count;
my @cc_count1;
my @cc_count2;
my @cc_count3;
my @cc_count4;
my @cc_count5;
my @cc_count6;
my @cc_count7;
my @cc_count8;
my @cc_count9;
my @cc_count10;
my @cc_count11;

my @total;
my $i;

my @count   = 0;
my @count1  = 0;
my @count2  = 0;
my @count3  = 0;
my @count4  = 0;
my @count5  = 0;
my @count6  = 0;
my @count7  = 0;
my @count8  = 0;
my @count9  = 0;
my @count10 = 0;
my @count11 = 0;

my $date  = 'sbc_cause_comp_intraday_2017-02-03_00_01';
my $date1 = substr( $date, 24, 10 );
my $hour  = substr( $date, 35, 1 );

#print ($hour);

open INPUT,   "< $input_file"   or die "$0: open of $input_file failed, error: $! \n";
open INPUT1,  "< $input_file1"  or die "$0: open of $input_file1 failed, error: $! \n";
open INPUT2,  "< $input_file2"  or die "$0: open of $input_file2 failed, error: $! \n";
open INPUT3,  "< $input_file3"  or die "$0: open of $input_file3 failed, error: $! \n";
open INPUT4,  "< $input_file4"  or die "$0: open of $input_file4 failed, error: $! \n";
open INPUT5,  "< $input_file5"  or die "$0: open of $input_file5 failed, error: $! \n";
open INPUT6,  "< $input_file6"  or die "$0: open of $input_file6 failed, error: $! \n";
open INPUT7,  "< $input_file7"  or die "$0: open of $input_file7 failed, error: $! \n";
open INPUT8,  "< $input_file8"  or die "$0: open of $input_file8 failed, error: $! \n";
open INPUT9,  "< $input_file9"  or die "$0: open of $input_file9 failed, error: $! \n";
open INPUT10, "< $input_file10" or die "$0: open of $input_file10 failed, error: $! \n";
open INPUT11, "< $input_file11" or die "$0: open of $input_file11 failed, error: $! \n";

open OUTPUT, "> $output_file" or die "$0: open of $output_file failed, error: $! \n";

print OUTPUT ( "**********************************************************************\n" );

while ( defined( $_ = <INPUT> ) ) {

    my $line = $_;
    my ( $ne, $cc_type, $cc_count ) = split( '\t', $line );
    my $count = trim( $cc_count );

    print( "$ne\n" );

    while ( defined( $_ = <INPUT1> ) ) {
        my $line1 = $_;
        my ( undef, undef, $cc_count1 ) = split( '\t', $line1 );
        my $count1 = trim( $cc_count1 );

        #print("$count1\n");
        while ( defined( $_ = <INPUT2> ) ) {
            my $line2 = $_;
            my ( undef, undef, $cc_count2 ) = split( '\t', $line2 );
            my $count2 = trim( $cc_count2 );

            #print("$cc_count2\n");
            while ( defined( $_ = <INPUT3> ) ) {
                my $line3 = $_;
                my ( undef, undef, $cc_count3 ) = split( '\t', $line3 );
                my $count3 = trim( $cc_count3 );

                #print("$cc_count3\n");
                while ( defined( $_ = <INPUT4> ) ) {
                    my $line4 = $_;
                    my ( undef, undef, $cc_count4 ) = split( '\t', $line4 );
                    my $count4 = trim( $cc_count4 );

                    # print("$cc_count4\n");
                    while ( defined( $_ = <INPUT5> ) ) {
                        my $line5 = $_;
                        my ( undef, undef, $cc_count5 ) = split( '\t', $line5 );
                        my $count5 = trim( $cc_count5 );

                        #print("$cc_count5\n");
                        while ( defined( $_ = <INPUT6> ) ) {
                            my $line6 = $_;
                            my ( undef, undef, $cc_count6 ) = split( '\t', $line6 );
                            my $count6 = trim( $cc_count6 );

                            #print("$cc_count6\n");
                            while ( defined( $_ = <INPUT7> ) ) {
                                my $line7 = $_;
                                my ( undef, undef, $cc_count7 ) = split( '\t', $line7 );
                                my $count7 = trim( $cc_count7 );

                                #print("$cc_count7\n");
                                while ( defined( $_ = <INPUT8> ) ) {
                                    my $line8 = $_;
                                    my ( undef, undef, $cc_count8 ) = split( '\t', $line8 );
                                    my $count8 = trim( $cc_count8 );

                                    #print("$cc_count8\n");
                                    while ( defined( $_ = <INPUT9> ) ) {
                                        my $line9 = $_;
                                        my ( undef, undef, $cc_count9 ) = split( '\t', $line9 );
                                        my $count9 = trim( $cc_count9 );

                                        #print("$cc_count9\n");
                                        while ( defined( $_ = <INPUT10> ) ) {
                                            my $line10 = $_;
                                            my ( undef, undef, $cc_count10 ) = split( '\t', $line10 );
                                            my $count10 = trim( $cc_count10 );

                                            #print("$cc_count10\n");
                                            while ( defined( $_ = <INPUT11> ) ) {
                                                my $line11 = $_;
                                                my ( undef, undef, $cc_count11 ) = split( '\t', $line11 );
                                                my $count11 = trim( $cc_count11 );

                                                #print("$cc_count11\n");

                                                for ( $i = 0; $i < scalar @count; $i++ ) {

                                                    $total[$i] = $count[$i]
                                                            + $count1[$i]
                                                            + $count2[$i]
                                                            + $count3[$i]
                                                            + $count4[$i]
                                                            + $count5[$i]
                                                            + $count6[$i]
                                                            + $count7[$i]
                                                            + $count8[$i]
                                                            + $count9[$i]
                                                            + $count10[$i]
                                                            + $count11[$i];

                                                    #   print("@total\n");
                                                }

                                                print OUTPUT (
                                                    "$date1 $hour $ne $cc_type  $count  $count1 $count2 $count3 $count4 $count5 $count6 $count7 $count8 $count9 $count10    $count11 $total\n"
                                                );

                                                #   print("@total\n");
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

close OUTPUT;
close INPUT;
close INPUT1;
close INPUT2;
close INPUT3;
close INPUT4;
close INPUT5;
close INPUT6;
close INPUT7;
close INPUT8;
close INPUT9;
close INPUT10;
close INPUT11;

Can you please help me to write this code more logically? Right now, I am obtaining the values in an array but the output file shows only the first row as true, and the remaining rows just the iteration of the first one, so it is not true.

The second problem is I don't know how to add number values in one row properly. I want to add those values for each row and write this value as a final column in output file.

Here is the example of output file;

 DATE / HOUR    NE  CC TYPE FILE 00_01  FILE / 00_06    FILE / 00_11    FILE  
 00_16  FILE  00_21 FILE  00_26 FILE 00_31  FILE 00_36  FILE  00_41 FILE  
 00_46  FILE  / 00_51   FILE  / 00_56   TOTAL
 2/3/2017 00    ACME A  0   2   4   43  4   4   25  4   3   26  4   4   18  141
 2/3/2017 00    ACME A  1   0   0   1   8   0   0   0   0   4   0   0   0   13
 2/3/2017 00    ACME A  2   0   0   0   0   0   0   0   0   0   0   0   0   0
 2/3/2017 00    ACME A  3   0   0   3   1   0   6   5   0   6   1   4   1   27
 2/3/2017 00    ACME A  4   0   0   0   0   0   0   0   0   0   0   0   0   0
 2/3/2017 00    ACME A  5   0   0   0   0   0   0   0   0   0   0   0   0   0

You know you are allowed to finish one file before starting to read another, right? — infixed, May 05 '17 at 19:21
Have you considered scanning the directory and processing each file in a uniform manner? Check out [`readdir`](http://perldoc.perl.org/functions/readdir.html) or the [Path::Tiny](https://metacpan.org/pod/Path::Tiny) module, for example. — Matt Jacob, May 05 '17 at 19:40
Just as a suggestion, write a subroutine that takes a file name as an argument, opens that file, reads each line, get your column of interest and sums that into a total. Then close the file and return that total as the result of the subroutine. Then you can call that subroutine for each of your file names in your main section. — infixed, May 05 '17 at 19:49
Don't use subroutine prototypes: `sub trim($) { ... }` should be `sub trim { ... }`, although I don't think you need to trim your data at all. — Borodin, May 05 '17 at 20:23
[When you find yourself adding an integer suffix to variable names, think "*I should have used an array*".](https://stackoverflow.com/a/1829927/100754) — Sinan Ünür, May 05 '17 at 21:46

Borodin · Accepted Answer · 2017-05-06T10:02:26.100

Update

Now that I understand your requirement better I can write a more appropriate solution

To test my code I've used twelve copies of this input file

ACME A  0   2
ACME A  1   3
ACME A  2   5
ACME A  3   7
ACME A  4   11
ACME A  5   13
ACME A  6   17

which is the same as yours except that I've added some variation to the last column to make it clearer whether the code is working

Note that I have added use autodie, which removes the need to explicitly check the status of file operations like open

I've used map to convert an array of file names to an array of open file handles @fh, and then until ( any { eof $_ } @fh ) { ... } to read a line from each of the files until any of them reaches end of file

use strict;
use warnings 'all';
use autodie;

use List::Util qw/ any sum /;

my @minutes = qw/ 01 06 11 16 21 26 31 36 41 46 51 56 /;

my @files = map "C:/Perl64/output/sbc_cause_comp_intraday_2017-02-03_00_$_.txt", @minutes;

my $output_file = "C:/Perl64/output/denemecik.txt";

my ( $date, $hour ) = $files[0] =~ /(\d\d\d\d-\d\d-\d\d)_(\d\d)/;

my @fh = map {
    open my $fh, '<', $_;
    $fh;
} @files;

open my $out_fh, '>', $output_file;

until ( any { eof $_ } @fh ) {

    my ( $ne, $cc_type, $cc_count );

    my @data = map {
        chomp( my $line = <$_> );
        ( $ne, $cc_type, $cc_count ) = split /\t/, $line;
        $cc_count;
    } @fh;

    print $out_fh join( "\t", $date, $hour, $ne, $cc_type, @data, sum @data ), "\n";
}

output

2017-02-03  00  ACME A  0   2   2   2   2   2   2   2   2   2   2   2   2   24
2017-02-03  00  ACME A  1   3   3   3   3   3   3   3   3   3   3   3   3   36
2017-02-03  00  ACME A  2   5   5   5   5   5   5   5   5   5   5   5   5   60
2017-02-03  00  ACME A  3   7   7   7   7   7   7   7   7   7   7   7   7   84
2017-02-03  00  ACME A  4   11  11  11  11  11  11  11  11  11  11  11  11  132
2017-02-03  00  ACME A  5   13  13  13  13  13  13  13  13  13  13  13  13  156
2017-02-03  00  ACME A  6   17  17  17  17  17  17  17  17  17  17  17  17  204

I hope this helps

It is far from clear what you want, as your words describe something very different from what your code does

Here's what I think you want. It basically makes your code work. It calculates a total of the third column for each of the files and prints them out in a single line, just like your print OUTPUT statement

If this isn't what you want then you need to explain things better and gives some clear examples

use strict;
use warnings 'all';
use autodie;

my @files = map "C:/Perl64/output/sbc_cause_comp_intraday_$_.txt", qw/
    2017-02-03_00_01
    2017-02-03_00_06
    2017-02-03_00_11
    2017-02-03_00_16
    2017-02-03_00_21
    2017-02-03_00_26
    2017-02-03_00_31
    2017-02-03_00_36
    2017-02-03_00_41
    2017-02-03_00_46
    2017-02-03_00_51
    2017-02-03_00_56
/;

my $output_file = "C:/Perl64/output/denemecik.txt";

my $date = 'sbc_cause_comp_intraday_2017-02-03_00_01';
my ( $date1, $hour ) = $date =~ /(\d\d\d\d-\d\d-\d\d)_(\d\d)/;

my @counts;
my ( $ne, $cc_type );

for my $file ( @files ) {

    push @counts, 0;

    open my $fh, '<', $file;

    while ( <$fh> ) {
        my @fields = split /\t/;
        ( $ne, $cc_type ) = @fields unless $ne;
        $counts[-1] += $fields[2];
    }
}

my $total;
$total += $_ for @counts;

{
    open my $fh, '>', $output_file;
    print $fh join( ' ', $date1, $hour, $ne, $cc_type, @counts, $total ), "\n";
}

@zdim: To be honest I think your guess was just as likely to be correct as mine! — Borodin, May 05 '17 at 20:21
I think now that: "_write it into a new file for each of those files_" meant that ("_information_") from "_each of those files_" goes into (one) "_a new file_" -- what agrees with their code. I think that you have the intent exactly right. — zdim, May 05 '17 at 20:32
@zdim: Ah I see. I think you may be right. I just ignored the words and wrote something that did what the code appeared to be trying to do, disregarding the confusion between arrays and scalars. — Borodin, May 05 '17 at 20:45
Hi Borodin, firstly ı am sorry about my English :). In the output file, there should be 17 columns, including date, hour, equipment type like ACME A, cc type. Date- hour- NE- and cc type values are the same for all input files. The only thing that is changed is the cc_value ( which is column id 3 in every input file ). I want to write this values as a columns in the output file. For example like this; — evenstar, May 05 '17 at 22:39
Date-Hour-NE-CC TYPE- values in the 1.st file- values in the 2. file... - .......values in the 12th file ( 3.column ) - total count of the values tthank you for your kind help. — evenstar, May 05 '17 at 22:44
@evenstar: So you want one line of output for each line of input? Is the number of lines the same in every file? What exactly is in the last column? It sounds like you want a total of the preceding 12 values in that line; is that right? I won't be able to do this tonight as it's midnight right now, but I'll work on it in the morning. — Borodin, May 05 '17 at 22:55
@Borodin, yes in every input file the number of columns and also rows are the same. Basically they have 3 columns and a lot of rows. Last column in the output file is the addition result of the values between column ID 5-16 for each row. And my files around 200MB in total not each. Thanks — evenstar, May 06 '17 at 09:45
Thanks for your help. can you please explain this part, my @data = map { chomp( my $line = <$_> ); ( $ne, $cc_type, $cc_count ) = split /\t/, $line; $cc_count; } @fh; — evenstar, May 06 '17 at 11:04
I didnt understand the use of map function exactly, the rest is ok . — evenstar, May 06 '17 at 11:11
@evenstar: `map` converts one list of values into another by applying a block of code to each item of the original list. That second `map` converts `@fh` (the array of file handles) into `@data` (the value in the third column of the files). It does it by reading from the file handle into `$line`, `chomp`ing that line, splitting it into three fields `$ne`, `$cc_type`, and `$cc_count`, and returning the third field `$cc_count`. Using this to map from the list of 12 file handles results in a list of 12 values for the third column of the next line from each handle. — Borodin, May 06 '17 at 11:54

How can I read multiple tab-separated files without using a lot of while loops

1 Answers1

Update

output