0

I am having trouble with getting my perl script to work. The issue might be related to the reading of the Extract file line by line within the while loop, any help would be appreciated. There are two files

Bad file that contains a list of bad IDs (100s of IDs)

2
3

Extract that contains a delimited data with the ID in field 1 (millions of rows)

1|data|data|data
2|data|data|data
2|data|data|data
2|data|data|data
3|data|data|data
4|data|data|data
5|data|data|data

I am trying to remove all the rows from the large extract file where the IDs match. There can be multiple rows where the ID matches. The extract is sorted.

#use strict;
#use warnning;

$SourceFile = $ARGV[0];
$ToRemove = $ARGV[1];
$FieldNum = $ARGV[2];
$NewFile = $ARGV[3];
$LargeRecords = $ARGV[4];

open(INFILE, $SourceFile) or die "Can't open source file: $SourceFile \n";
open(REMOVE, $ToRemove) or die "Can't open toRemove file: $ToRemove \n";
open(OutGood, "> $NewFile") or die "Can't open good output file \n";
open(OutLarge, "> $LargeRecords") or die "Can't open Large Records output file \n";


#Read in the list of bad IDs into array
@array = <REMOVE>;

#Loop through each bad record 
foreach (@array)
{
$badID = $_;

#read the extract line by line 
while(<INFILE>)
{
    #take the line and split it into 
    @fields = split /\|/, $_;
    my $extractID = $fields[$FieldNum];

    #print "Here's what we got: $badID and $extractID\n";

    while($extractID == $badID) 
    {
        #Write out bad large records
        print OutLarge join '|', @fields;

        #Get the next line in the extract file
        @fields = split /\|/, <INFILE>;
        my $extractID = $fields[$FieldNum];

        $found = 1; #true

        #print " We got a match!!";

        #remove item after it has been found 
        my $input_remove = $badID;
        @array = grep {!/$input_remove/} @array;


    }

print OutGood join '|', @fields;

}

}
  • I think the issue I am having is with reading the next line within the while loop. I find the match and get stuck in the loop. – user1090708 Jan 14 '14 at 15:26

3 Answers3

2

Try this:

$ perl -F'|' -nae 'BEGIN {while(<>){chomp; $bad{$_}++;last if eof;}} print unless $bad{$F[0]};' bad good
Austin Hastings
  • 617
  • 4
  • 13
1

First, you are lucky: The number of bad IDs is small. That means, you can read the list of bad IDs once, stick them in a hash table without running into any difficulty with memory usage. Once you have them in a hash, you just read the big data file line by line, skipping output for bad IDs.

#!/usr/bin/env perl

use strict;
use warnings;

# hardwired for convenience
my $bad_id_file = 'bad.txt';
my $data_file = 'data.txt';

my $bad_ids = read_bad_ids($bad_id_file);

remove_data_with_bad_ids($data_file, $bad_ids);

sub remove_data_with_bad_ids {
    my $file = shift;
    my $bad = shift;

    open my $in, '<', $file
        or die "Cannot open '$file': $!";
    while (my $line = <$in>) {
        if (my ($id) = extract_id(\$line)) {
            exists $bad->{ $id } or print $line;
        }
    }

    close $in
        or die "Cannot close '$file': $!";
    return;
}

sub read_bad_ids {
    my $file = shift;
    open my $in, '<', $file
        or die "Cannot open '$file': $!";

    my %bad;
    while (my $line = <$in>) {
        if (my ($id) = extract_id(\$line)) {
            $bad{ $id } = undef;
        }
    }
    close $in
        or die "Cannot close '$file': $!";
    return \%bad;
}

sub extract_id {
    my $string_ref = shift;
    if (my ($id) = ($$string_ref =~ m{\A ([0-9]+) }x)) {
        return $id;
    }
    return;
}
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • Thank you this works, however could you please explain shift function and how it applies here? I just read this: http://stackoverflow.com/questions/296964/what-does-shift-do-in-perl but would like a bit of a clarification. – user1090708 Jan 14 '14 at 17:21
  • 1
    [shift](http://perldoc.perl.org/functions/shift.html) and [perlsub](http://perldoc.perl.org/perlsub.html): *Any arguments passed in show up in the array `@_`.* – Sinan Ünür Jan 14 '14 at 21:02
1

I'd use a hash as follows:

use warnings;
use strict;

my @bad = qw(2 3);

my %bad;

$bad{$_} = 1 foreach @bad;

my @file = qw (1|data|data|data 2|data|data|data 2|data|data|data 2|data|data|data 3|data|data|data 4|data|data|data 5|data|data|data);

my %hash;
foreach (@file){
    my @split = split(/\|/);
    $hash{$split[0]} = $_;
}

foreach (sort keys %hash){
    print "$hash{$_}\n" unless exists $bad{$_};
}

Which gives:    

1|data|data|data
4|data|data|data
5|data|data|data
fugu
  • 6,417
  • 5
  • 40
  • 75