0

I have a txt file with taxonomic assignations like:

#name_file

Bacteria;WS3;PRR-12;SSS58A 0.0 0.12 0.6

Bacteria;WS3;PRR-12;Sediment-1 0.5 0.1 0.3

Bacteria;Terrabacteria_group;Firmicutes;Bacilli; unclassified_Bacillales;Bacillaceae;Vulcanibacillu 0.2 0.2 0.6

Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillaceae;Vulcanibacillu 0.2 0.2 0.6

Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillales_incertae_sedis;Bacillales_Family_X 0.1 0.3 0.5

Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillales_incertae_sedis;Bacillales_Family_X._Incertae_Sedis;Thermicanus 0.4 0.13 0.9

Bacteria;Nitrospirae;Nitrospira;Nitrospirales;Thermodesulfovibrionaceae 0.1 0.2 0.6

Bacteria;Nitrospirae;Nitrospira;Nitrospirales;Thermodesulfovibrionaceae;BD2-6 0.0 0.0 0.6

Bacteria;PVC_group;Lentisphaerae;Lentisphaeria;Lentisphaerales 0.7 0.2 0.1

so I want to extract the first and the second (only if the second finish ales_incertae_sedis) words that match "ales" in each line, the print OUT will be like:

Bacillales
Bacillales;Bacillales_incertae_sedis 
Bacillales;Bacillales_incertae_sedis
Nitrospirales
Nitrospirales
Lentisphaerales

but not the third one as:

Bacillales;Bacillales_incertae_sedis;Bacillales_Family

I have tried:

use strict;
use warnings;
use Getopt::Long;

GetOptions (
    'i=s'       =>\$infile,
);


open INFILE, '<', "$infile", or die "cant open file $infile";    
open OUTFILE, '>', "$results.txt" or die "cant open"; 

while ( <INFILE>) {
    my $line = $_;
    chomp($line);
    if ($line=~ m/^#/g) {
        next;
    }
    elsif ($line=~ m/^$/g){
        next;
    }

    elsif($line){
        my @taxonomic=$_;
        foreach (@taxonomic){
            ($taxon, $val1, $val2, $val3) = split(/\t/,$_);
        }
    #here is the problem 
        my (@orden) = ($taxon=~ m/(\w*ales)[\;]?/g);
        foreach (@orden){
           if ($_=~m/^$/g){
               next;
           }
           elsif ($_=~ m/^unclassified/g){
               next;
           }
           else {
               print OUTFILE "$_\n";
           }
       }
   }
}
close INFILE;            
close OUTFILE;
exit;

my problem is the line :

my (@orden) = ($taxon=~ m/(\w*ales)[\;]?/g);

I`ve tried to make it for choose multiples options

my (@orden) = ($taxon=~ m/(\w*ales)[\;]?(;\w*ales_incertae_sedis)/g);
my (@orden) = ($taxon=~ m/(\w*ales[;\w*ales_incertae_sedis]?)[\;]?/g);

but it don´t work.

thanks so much

Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
abraham
  • 661
  • 8
  • 14
  • I suggest you split `$taxon` on `;` then loop through the items, checking each for `/ales/`, also having a counter `$i` that increases for each match. If `$i == 2` the item must also match `/ales_incertae_sedis$/ `. If `$i > 2` nothing is printed. – Håkon Hægland Aug 05 '16 at 20:23

1 Answers1

0

Try this

use warnings;
use strict;
my $m;

while ( <INFILE>>) 
{
    if($_=~/(?:([a-z]+ales;[^;]+;).+?family|(\w+ales;))/i )
    {
            $m = $1 || $2;
            print "$m\n" if($m!~/^unc/)

    }   
}

In above i used the non-capturing group (?:)

More about non-capturing group see this answer

Community
  • 1
  • 1
mkHun
  • 5,891
  • 8
  • 38
  • 85
  • Why did you use `(?{ })` here? You could simply do `my $m = $1 || $2;` in the next line. – melpomene Aug 06 '16 at 09:14
  • @melpomene Simply I used some special thing. Why is there any problem of using it? – mkHun Aug 06 '16 at 09:45
  • 1
    `(?{ }`) is a pretty obscure feature. It doesn't buy you anything here; if anything, it makes the code harder to read. If you're going to use it, why move just the assignment to `$m` into the regex? You could've put the whole `print` statement in there. Up until perl 5.20, it had this note in the documentation: "**WARNING**: *This extended regular expression feature is considered experimental, and may be changed without notice.*" And indeed it had serious bugs related to parsing/scoping up until perl 5.18, where it was rewritten. – melpomene Aug 06 '16 at 11:15
  • @melpomene Thank you for your comment. I don't know this an experimental. Post edited. Thanks again :) – mkHun Aug 06 '16 at 11:29
  • thanks So much all of you, I will try all yours advises !! Thanks – abraham Aug 09 '16 at 17:29