Replacing numerical values in a FASTA file with their index in a different file. (Bash preferred)

Question

I have a folder full of fasta files with the following format (and so on), where the line beginning with > is the read name of DNA sequence, and the following line is the sequence itself. This pattern repeats for the entire file:

> 887_ENCFF899MTI.fastq.gz_seq1
GGCCCGCCTCCCGTCGGCCGGTGCGAGCGGCTCCGCGA
> 55_ENCFF899MTI.fastq.gz_seq2
GGGGGGGGCGTCTCGCGCAAACGTCCATAAC
> ...
...

In the read names, [887] corresponds to the index of a query sequence I used to find this read, stored in a different file (e.g. SequenceNames.txt). The other file can be assumed to have this format:

SequenceA
SequenceB
...

I want to replace only the number between > and _ (avoiding incidental matches with the filename) with the Sequence matching the index of that number from the SequenceNames file. For example, I would want

> 1_ENCFF899MTI.fastq.gz_seq1
ACTATC
> 2_ENCFF899MTI.fastq.gz_seq1

to become

> SequenceA_ENCFF899MTI.fastq.gz_seq1
> SequenceB_ENCFF899MTI.fastq.gz_seq1

I am able to make these replacements generally, but I'm really unsure of how to direct the index replacement specifically to the location/regex match between > and _ without performing a file-wide dictionary replacement of these numbers, and I'm struggling with awk array indexing to get something like

gawk '{print gensub(/^> ([0-9]*)_/,array[pattern],"\\1")}'

to produce what I'm looking for.

please update the question to show the expected result corresponding to the provided sample inputs — markp-fuso, May 05 '23 at 22:19
`1` / `2` => `A` / `B` is somewhat clear; what isn't clear is the first sample of `887` and `15`; if the file contains only 2 lines with the values `887` and `15` .... are these numbers supposed to be replaced with whatever is at lines #887 and #15 from the 2nd file ... or do we replace `887` (1st line in file) with whatever is in the 1st line of the 2nd file, and replace `15` (2nd line of file) with whatever is in the 2nd line of the 2nd file? — markp-fuso, May 05 '23 at 22:40
the current `gawk` code is incomplete, eg, how are you populating the variable `pattern` and the array `array[]`? — markp-fuso, May 05 '23 at 22:42
In all honesty, that's what I'm struggling with. The goal is your first understanding -- that numbers 887 and 15 would correspond to the 887th and 15th entries in the Names file. — jelfman, May 05 '23 at 23:16

score 2 · Accepted Answer · answered May 06 '23 at 00:32

Using gawk:

awk 'NR==FNR{ar["> "NR"_"]=$0} 
NR>FNR{match($0,/^> [0-9]+_/,m); gsub(/^> [0-9]+_/, "> " ar[m[0]]"_", $0);print} ' SequenceNames.txt matches.fasta

The NR==FNR block collects line data from the sequence name file in an array indexed with a string built from ">", the line number, and the trailing "_" character.

The NR>FNR block stores the string matched to a regex requiring the line-start ">" followed by a space, a number, and the underscore in array m. Gsub is then used to replace the match with the corresponding value held in the sequence name array.

Tested using GNU Awk 5.1.0

Thanks, this seems to have worked perfectly. Also really appreciate the by-line runthrough. — jelfman, May 06 '23 at 03:51

score 2 · Answer 2 · answered May 06 '23 at 04:24

2

In GNU awk with your shown samples please try following awk code.

awk '
FNR==NR{
  seqVal[FNR]=$0
  next
}
/^>/ && match($0,/(^> )[^_]*(_.*$)/,arr){
  print arr[1] seqVal[++count] arr[2]
}
' sequencenames.txt input.fasta

answered May 06 '23 at 04:24

RavinderSingh13

130,504
14
57
93

jhnc · Answer 3 · 2023-05-06T01:25:32.370

1

Assuming the desire is to replace the number in a line of the fasta file that starts > 887_ with content of the 887th line of SequenceNames.txt, then:

awk '
    # create lookup table
    NR==FNR { i2p[FNR] = $0; next }

    # try to change relevant lines
    $1==">" {

        # extract index
        idx = $2
        sub( /_.+/, "", idx )

        # try to replace (no change if no match)
        if (idx in i2p)
            sub( /^[^_]+/, "> "i2p[idx] )
    }

    # print all lines
    1
' SequenceNames.txt input.fasta >output.fasta

edited May 06 '23 at 01:25

answered May 05 '23 at 23:24

jhnc

11,310
1
9
26

The updated version of this also seems to work. Really much appreciated, and useful to have two similar (but different) approaches to learn from. – jelfman May 06 '23 at 03:56

Replacing numerical values in a FASTA file with their index in a different file. (Bash preferred)

3 Answers3