-2

If any of you could modify the code so that the sequence names in file 1 are searched within file 2, and if there is a match, the lines in file 1 and its next line are copied to an outfile. right now the code only copies the matched titles but not its next line which is the sequence to the outfile. thanks

for example:

FILE 1 :

SEQUENCE 1 NAME

SEQUENCE 2 NAME

SEQUENCE 3 NAME

FILE 2:

SEQUENCE 1 NAME

AGTCAGTCAGTCAGTCAGTC

SEQUENCE 2 NAME

AAGGGTTTTCCCCCCAAAAA

SEQUENCE 3 NAME

GGGGTTTTTTTTTTAAAAAC

SEQUENCE 4 NAME

AAGTCCCCCCCCCCAAGGTT

etc.

OUTFILE:

SEQUENCE 1 NAME

AGTCAGTCAGTCAGTCAGTC

SEQUENCE 2 NAME

AAGGGTTTTCCCCCCAAAAA

SEQUENCE 3 NAME

GGGGTTTTTTTTTTAAAAAC

code: 

use strict;
use warnings;

my $f1 = 'FILE1.fasta';
open FILE1, "$f1" or die "Could not open file \n";
my $f2= 'FILE2.fasta';
open FILE2, "$f2" or die "Could not open file \n";

my $outfile = $ARGV[1];

my @outlines;
my $n=0;
foreach (<FILE1>) {
    my $y = 0;
    my $outer_text = $_ ;


    seek(FILE2,0,0);
    foreach (<FILE2>) {
        my $inner_text = $_;

        if($outer_text eq $inner_text) {    

            print "$outer_text\n";
            push(@outlines, $outer_text);
            $n++;

        }
    }
}
open (OUTFILE, "sequences.fasta") or die "Cannot open $outfile \ +n";
print OUTFILE @outlines;
close OUTFILE;
  • 1
    I don't know what "file 1 is searched within file 2" is supposed to mean. –  Aug 12 '13 at 16:54
  • so taking the sequence names in file 1, search to see if they are in file 2, if they are, print the names and the next lines to outfile. – user2675738 Aug 12 '13 at 16:55
  • 1
    Are these files large? If so, your current code is doing `m*n` comparisons (where `m` and `n` are the number of lines in file 1 and file 2, respectively). –  Aug 12 '13 at 16:57
  • the files can get very large at times, i'm not an expert perl coder. in this case i would like to know how to write the next line following the sequence name to an output file. – user2675738 Aug 12 '13 at 16:59
  • Well, each call to `` grabs a line from file 2. By the way, [two argument opens are bad](http://stackoverflow.com/questions/1479741/why-is-three-argument-open-calls-with-autovivified-filehandles-a-perl-best-pract). –  Aug 12 '13 at 17:00
  • There is no need to print the sequence name line from `file1` plus the line from `file2` if there is a match, obviously `file2` has the sequence name line and your sequence in it, so if it matches just print the results of `file2`? – hwnd Aug 12 '13 at 17:14
  • And how exactly is your data? What's the format of `SEQUENCE 1 NAME`? – hwnd Aug 12 '13 at 17:18
  • @hwnd yes that also makes sense. – user2675738 Aug 12 '13 at 17:23
  • @hwnd hi, my file is in fast format : >sequence name 1 followed by the sequence in the next line. – user2675738 Aug 12 '13 at 17:24
  • the problem is that i tried but couldn't figure out how to simply add a line or a few lines of code to output the line after the sequence name. the code now only outputs the matching sequence names without giving the sequences followed. – user2675738 Aug 12 '13 at 17:26
  • So your line starts with a `>`? And is the name exactly SEQUENCE or something different on every line? – hwnd Aug 12 '13 at 17:34
  • @hwnd no so what i wrote here are just some examples to make things simpler. it is actually >@M01438:4:000000000-A407V followed by sequence, but every name and sequence are different. – user2675738 Aug 12 '13 at 17:35

2 Answers2

0

For very large FILE1, %seen hash could be tied to some of DBM storage,

use strict;
use warnings;

my $f1 = 'FILE1.fasta';
open FILE1, "<", $f1 or die $!;
my $f2 = 'FILE2.fasta';
open FILE2, "<", $f2 or die $!;

# my $outfile = $ARGV[1];
open OUTFILE, ">", "sequences.fasta" or die $!;

my %seen;
while (<FILE1>) {
    $seen{$_} = 1;
}

while (<FILE2>) {
    my $next_line = <FILE2>;

    if ($seen{$_}) {    
        print OUTFILE $_, $next_line;
    }
}
close OUTFILE;
mpapec
  • 50,217
  • 8
  • 67
  • 127
  • @user2675738 you can accept answer if it works for you – mpapec Aug 12 '13 at 17:44
  • It's possibly a good idea to use `$seen{$1} = 1 if /^>(\S+)/;` skipping the `>` and storing the sequence since they are all different. Just IMO. – hwnd Aug 12 '13 at 17:55
0

I would put the contents of file 2 into a hash, then check if each record from file 1 was in the hash:

#!perl
use strict;
use warnings;

my $f2= 'FILE2.fasta';
open FILE2, "$f2" or die "Could not open file \n";

my $k;
my $v;
my %hash;

while (defined($k = <FILE2>)) {
        chomp $k;
        $v = <FILE2>;
        $hash{$k} = $v;
}

my $f1 = 'FILE1.fasta';
open FILE1, "$f1" or die "Could not open file \n";

open (OUTFILE, ">sequences.fasta") or die "Cannot open seqeneces.fasta\n";
while (<FILE1>) {
        chomp;
        if (exists($hash{$_})) {
                print OUTFILE "$_\n";
                print OUTFILE "$hash{$_}\n";
       }
}

close OUTFILE;
igotmumps
  • 549
  • 2
  • 7