I just recently started to use powershell in win7 in order to produce pipeline like scripts for the program mothur. Before I used bash scripting in ubuntu to do this. I am happy that everything works well now except one task:
I would like to like to format a fasta file that is in the form:
filename.fasta:
>HXXC990
AGTTCAAGGTCTCT
>HXXC991
GGGTTTCAAATCTC
>HXXC992
GGGTCTCTCCTATA
To a file that is tab-delimited and looks like that
output.file:
HXXC990 filename
HXXC991 filename
HXXC992 filename
It is important that the first column of the output file contains the names without the ">"-signs. and the second by tab delimited column the original filename.fasta without the suffix ("filename"). I have the solutions gci to read out the base name of the file and Select-String to output all the lines beginning with ">". The only problem remains the formatting in the two columns and the constant repetition of the file name in the second column.
I've tried so far:
Select-String '>' .\filename.fasta | % {$_.Line} | set-content output.txt
to produce a file containing only the lines that contain the ">" signs. Afterwards I just replaced them. The file name I've got by
$base1 = gci filename.fasta | % {$_.BaseName}