I have a txt file with taxonomic assignations like:
#name_file
Bacteria;WS3;PRR-12;SSS58A 0.0 0.12 0.6
Bacteria;WS3;PRR-12;Sediment-1 0.5 0.1 0.3
Bacteria;Terrabacteria_group;Firmicutes;Bacilli; unclassified_Bacillales;Bacillaceae;Vulcanibacillu 0.2 0.2 0.6
Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillaceae;Vulcanibacillu 0.2 0.2 0.6
Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillales_incertae_sedis;Bacillales_Family_X 0.1 0.3 0.5
Bacteria;Terrabacteria_group;Firmicutes;Bacilli;Bacillales;Bacillales_incertae_sedis;Bacillales_Family_X._Incertae_Sedis;Thermicanus 0.4 0.13 0.9
Bacteria;Nitrospirae;Nitrospira;Nitrospirales;Thermodesulfovibrionaceae 0.1 0.2 0.6
Bacteria;Nitrospirae;Nitrospira;Nitrospirales;Thermodesulfovibrionaceae;BD2-6 0.0 0.0 0.6
Bacteria;PVC_group;Lentisphaerae;Lentisphaeria;Lentisphaerales 0.7 0.2 0.1
so I want to extract the first and the second (only if the second finish ales_incertae_sedis) words that match "ales" in each line, the print OUT will be like:
Bacillales
Bacillales;Bacillales_incertae_sedis
Bacillales;Bacillales_incertae_sedis
Nitrospirales
Nitrospirales
Lentisphaerales
but not the third one as:
Bacillales;Bacillales_incertae_sedis;Bacillales_Family
I have tried:
use strict;
use warnings;
use Getopt::Long;
GetOptions (
'i=s' =>\$infile,
);
open INFILE, '<', "$infile", or die "cant open file $infile";
open OUTFILE, '>', "$results.txt" or die "cant open";
while ( <INFILE>) {
my $line = $_;
chomp($line);
if ($line=~ m/^#/g) {
next;
}
elsif ($line=~ m/^$/g){
next;
}
elsif($line){
my @taxonomic=$_;
foreach (@taxonomic){
($taxon, $val1, $val2, $val3) = split(/\t/,$_);
}
#here is the problem
my (@orden) = ($taxon=~ m/(\w*ales)[\;]?/g);
foreach (@orden){
if ($_=~m/^$/g){
next;
}
elsif ($_=~ m/^unclassified/g){
next;
}
else {
print OUTFILE "$_\n";
}
}
}
}
close INFILE;
close OUTFILE;
exit;
my problem is the line :
my (@orden) = ($taxon=~ m/(\w*ales)[\;]?/g);
I`ve tried to make it for choose multiples options
my (@orden) = ($taxon=~ m/(\w*ales)[\;]?(;\w*ales_incertae_sedis)/g);
my (@orden) = ($taxon=~ m/(\w*ales[;\w*ales_incertae_sedis]?)[\;]?/g);
but it don´t work.
thanks so much