-1

I need to split data found in one file into other files based upon their job id. The structure of the file, and the two helper files, are:

# data.txt - data file - node,timestamp,data
1,1516,25
2,1845,24
3,1637,26
4,1342,74
5,1426,63
6,1436,23
7,1732,64
1,1836,83
2,1277,12
3,2435,62
4,2433,47
5,2496,52
6,2142,69
7,2176,53

# job.txt - job timing - job,startts,endts
1234,1001,2000
5678,2001,2500

# node.txt - node to job map - job,node
1234,1
1234,2
1234,3
1234,4
1234,5
5678,3
5678,4
5678,5
5678,6
5678,7

In order to map a line in the data file to its appropriate new file, two transformations must take place. First, the data timestamp must be used to determine which jobs are running. Second, the list of running jobs must be checked to determine which owns the node the data references. This is my solution:

use strict;
use warnings;

my @timing = ( );
my %nodes = ( );
my %handles = ( );

### array of arrays containing job, start time, and end time
open JOB, "<job.txt" or die "can't open jobfile, $!";
while (<JOB>) {
    my @fields = split /,/;  #/ stop SO highliter
    my @array = ($fields[0], $fields[1], $fields[2]);

    push @timing, \@array;
}
close JOB;

### map job -> array of nodes
open NID "<node.txt" or die "can't open nidfile";
while (<NID>) {
    my @fields = split /,/;  #/

    if (!exists $nodes{$fields[0]}) { $nodes{$fields[0]} = (); }
    push @{$nodes{$fields[0]}}, $fields[1];
}
close NID;

### split data
open DATA, "<data.txt" or die "Couldn't open file all.pow, $!";
while (<DATA>) {
    my @fields = split /,/;  #/

    my @jobs = grep {$fields[1] >= $_->[1] && $fields[1] <= $_->[2]} @timing;
    scalar @jobs > 0 or next;
    my $jid = (grep {!exists $nodes{$fields[0]}} @jobs)[0][0];

    ### create and memoize file handles
    if (!exists $handles{$jid}) {
        open my $temp, ">$jid.txt" or die "Can't open jfile $jid, $!";
        $handles{$jid} = $temp;
    }
    print {$handles{$jid}} "$fields[1],fields[2]";
}
close DATA;

I would like to know if there are any ways to increase the speed/efficiency of the data file loop. This has to run over large amounts of data, and so it needs to be as efficient as possible. I would also appreciate any comments on more idiomatic approaches: this is my first perl script (the array of references to arrays took quite a while to figure out).

Azure Heights
  • 271
  • 2
  • 8
  • Add [strict](http://p3rl.org/strict) and [warnings](http://p3rl.org/warnings). Then come back. – choroba Apr 19 '18 at 21:46
  • 5
    Might be more suitable for https://codereview.stackexchange.com – choroba Apr 19 '18 at 21:47
  • @choroba Both of those were done in my original script, but were overlooked when pasting into the question. I'll fix it. – Azure Heights Apr 19 '18 at 21:48
  • Good. Also, provide sample data rather than their description. See [mcve](https://stackoverflow.com/help/mcve). – choroba Apr 19 '18 at 21:50
  • Re: "_large amounts of data_" -- (1) how big is `jobfile`? (2) how big is `nidfile`? (3) how many filehandles are typically created? – zdim Apr 19 '18 at 22:08
  • @choroba Sample data added. Note that it is completely contrived. – Azure Heights Apr 19 '18 at 22:13
  • @zdim Should now execute. `jobfile` is 32 kilobytes, `nidfile` is 32 megabytes. I don't know exactly how many filehandles, 100s I would imagine. The machine I am using to process this data doesn't have a file handle limit. – Azure Heights Apr 19 '18 at 22:17
  • 1
    code reviews are off topic for stack overflow. – xaxxon Apr 19 '18 at 22:19
  • @xaxxon Noted, I'll post it on codereview as mentioned by choroba. Should I delete the one here? – Azure Heights Apr 19 '18 at 22:28
  • I think that your question about efficiency is perfectly fine here. (Perhaps perople were thrown off by the question about idioms appearing front and center?) How does your code perform (runtime?) and what do you need instead? Adding that to the question (along with file sizes from your comment above) may emphasize its optimization aspect. – zdim Apr 20 '18 at 02:31
  • Btw, this is very good for the "_first perl script_" – zdim Apr 20 '18 at 02:37
  • Also -- how often does the code end up writing to files (how big are output files typically)? This is the most expensive part and if it writes for most lines it makes no sense "optimizing" other aspects. But if it writes infrequently that's different. – zdim Apr 20 '18 at 03:00
  • @zdim I've only run it on a small subset of the data to make sure it works before I gave it all the data to chew through. There's enough that if the time it took to run through the sample scales linearly, it would take just over 2 days to run through a day of data. While this is acceptable, any improvement would be welcomed. It writes a line for every line in the data (it's just restructuring the data to a format better for other tools). – Azure Heights Apr 22 '18 at 20:54
  • ouch ... two days? So how big are the real files? (I thought it was 32Kb+32Mb) I'll look into details and add code to speed things up. – zdim Apr 23 '18 at 02:30
  • @zdim those are the lookup table information files. The actual data of interest is orders of magnitude larger. I chose Perl because I had read some where it was good for text manipulation (also because I'd been looking for an excuse to learn it, but that's another story) – Azure Heights Apr 24 '18 at 01:39
  • OK, that helps. Still, this should go much, much quicker. I am asking about all this because I'd like to understand the statement before "optimizing." Is all that data in one file (well, two), ready for processing? – zdim Apr 24 '18 at 03:23
  • Btw, Perl indeed excels at text manipulation of any kind. In my experience it is excellent for much of other kind of work as well. I draw the line when a compiled language is needed (mostly for speed), for very specific problem domains, or for a _really_ complex job. – zdim Apr 24 '18 at 03:27
  • @zdim no worries. Yes, all data in the files are in csv format, including the main file. The script is splitting the individual data in the main file to separate files based on job id, for analysis purposes. It needs the information in the lookup tables to move it correctly. – Azure Heights Apr 24 '18 at 04:00

1 Answers1

2

As for efficiency, if this code writes to files often (and needs to open many on the fly) I don't see a way to speed it up significantly. Allowing for pre- or post-processing may help but that depends on wider context.

Here is a quick scoop of some coding points. In order of appearance

  • There is exactly no use of "initializing" arrays and hashes to an empty list; just declare them

  • Always use lexical filehandles and three-argument open (like you do for writing). So

      open my $job_fh, '<', ...
    
  • There is no need to first create a key in order to assign to it. The first line below is unneeded

      #if (!exists $nodes{$fields[0]}) { $nodes{$fields[0]} = (); }  # NOT needed
      push @{$nodes{$fields[0]}}, $fields[1];
    

    This is by courtesy of autovivification: if an object is dereferenced in an lvalue context (when it's modifiable) it is created if it is undefined. This is a complex feature. Close to your use is this post; also see this Effective Perler article, and this post with its links.

  • Use array, or list, slices. Instead of

      my @array = ($fields[0], $fields[1], $fields[2]);
    

    one can do

      my @array = @fields[0,1,2];   # or @fields[0..2] in this case
    

    Then one can also do

      push @timing, [ @fields[0..2] ];
    

    or

      push @timing, [ (split /,/)[0..2] ];  # no need for @fields now
    

    which are more efficient since intermediate arrays aren't created. But these last two are less flexible, limiting possibilities for intermediate processing or checks. If this is all you do then

      my @timing = map { [ (split /,/)[0..2] ] } <$job_fh>;
    

    also goes, but it is even less flexible (and reads all lines at once)

  • Arrays in scalar context are evaluated to return the number of elements. So

      scalar @jobs > 0 or next;  # scalar unneeded
    

    does not need scalar and you can write

      @jobs > 0 or next;
    

    which reduces to

      @jobs or next;
    

    but what I'd rather write as

      next if not @jobs;   # or:  next unless @jobs;
    

    using statement modifier ("postfix syntax")


The <> operator is in list context (see link above) imposed by map so it returns all lines

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Is the difference between the file handles I use and the lexical ones the same as the difference between global and local variables? – Azure Heights Apr 19 '18 at 22:32
  • @AzureHeights That plus more, since it's about filehandles. In short, you practically never need typeglobs – zdim Apr 19 '18 at 22:34
  • 1
    @AzureHeights Added links to docs (one for typeglobs vs lexical) – zdim Apr 19 '18 at 22:53