I have a folder full of fasta files with the following format (and so on), where the line beginning with > is the read name of DNA sequence, and the following line is the sequence itself. This pattern repeats for the entire file:
> 887_ENCFF899MTI.fastq.gz_seq1
GGCCCGCCTCCCGTCGGCCGGTGCGAGCGGCTCCGCGA
> 55_ENCFF899MTI.fastq.gz_seq2
GGGGGGGGCGTCTCGCGCAAACGTCCATAAC
> ...
...
In the read names, [887] corresponds to the index of a query sequence I used to find this read, stored in a different file (e.g. SequenceNames.txt). The other file can be assumed to have this format:
SequenceA
SequenceB
...
I want to replace only the number between > and _ (avoiding incidental matches with the filename) with the Sequence matching the index of that number from the SequenceNames file. For example, I would want
> 1_ENCFF899MTI.fastq.gz_seq1
ACTATC
> 2_ENCFF899MTI.fastq.gz_seq1
to become
> SequenceA_ENCFF899MTI.fastq.gz_seq1
> SequenceB_ENCFF899MTI.fastq.gz_seq1
I am able to make these replacements generally, but I'm really unsure of how to direct the index replacement specifically to the location/regex match between > and _ without performing a file-wide dictionary replacement of these numbers, and I'm struggling with awk array indexing to get something like
gawk '{print gensub(/^> ([0-9]*)_/,array[pattern],"\\1")}'
to produce what I'm looking for.