2

I am trying to replace all characters that are not C, T, A or G with an N in the sequence part of a fasta file - i.e. every 2nd line

I think some combination of awk and tr is what I would need...

To print every other line:

awk '{if (NR % 2 == 0) print $0}' myfile

To replace these characters with an N

tr YRHIQ- N

...but I don't know how to combine them so that the character replacement is only on every 2nd line but it prints every line

this is the sort of thing I have

>SEQUENCE_1
AGCYGTQA-TGCTG
>SEQUENCE_2
AGGYGTQA-TGCTC

and I want it to look like this:

>SEQUENCE_1
AGCNGTNANTGCTG
>SEQUENCE_2
AGGNGTNANTGCTC

but not like this:

>SENUENCE_1
AGCNGTNANTGCTG
>SENUENCE_2
AGGNGTNANTGCTC
mcfiston
  • 23
  • 5

3 Answers3

3

The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:

  • The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
  • Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
  • The sequence can span multiple lines.
  • A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.

To answer the OP's question, If you just want to process every second line, you want to do:

awk '!(NR%2){gsub(/[^CTAG]/, "N")}1' file.fasta

This method will, however, fail on any of the following cases:

  • fasta file with a multi-line sequence
  • multi-fasta file with a possible blank-line between subsequent sequences

A better way would be to exclude the header line and process all other lines:

awk '!/^>/{gsub(/[^CTAG]/, "N")}1' file.fasta
kvantour
  • 25,269
  • 4
  • 47
  • 72
  • great info about fasta file. Thanks for sharing. I have no idea about what `fasta` file is. If you don't mind posting a sample of it and how this get generated? – JBone Mar 25 '19 at 12:38
  • Fasta files are a common input/output format for DNA sequence data. When you do DNA sequencing this is the format that you will often receive your data. They look like the examples given in the original question, although like @kvantour says - they may be multi-line, i.e. 1 sequence can be represented by 1 header line followed by multiple sequence lines. Increasingly common now are fastq files, which are the same, but are restricted to 4 lines per seqeunce: Line 1 - header line (starts with an "@"); line 2 - sequence line; line 3 - "+"; line 4 - quality information of each of the DNA letters – mcfiston Mar 25 '19 at 16:09
2

Thanks to @kvantour's explanation on fasta files, here is another sed solution which suits your task better than the old one:

sed '/^>/! s/[^ACTG]/N/g' file.fasta
  • /^>/!: do the following if this line doesn't begin with a >,
  • s/[^ACTG]/N/g: replace every character but ACTG with N.
oguz ismail
  • 1
  • 16
  • 47
  • 69
  • I get the following error - do you know why?: sed: 1: "0~2 s/[^ACGT]/N/g": invalid command code ~ – mcfiston Mar 25 '19 at 07:13
  • any idea how to check that? It seems that it is not straightforward on OS X ??: https://stackoverflow.com/q/37639496/5878624 – mcfiston Mar 25 '19 at 07:21
1

Here is one solution with awk

awk 'NR%2 ==0{gsub(/[^CTAG]/, "N")}1' file

the result

SEQUENCE_1
AGCNGTNANTGCTG
SEQUENCE_2
AGGNGTNANTGCTC

Explanation As OP wanted, I am only looking for every even row to apply the change by
NR/2 == 0

NR is number of records (rows here) read so far from the file

and gsub(/[^CTAG]/, "N") replace with all the characters that are NOT 'C', 'T', 'A', 'G'

[^CTAG] the ^ is the negation

and awk goes by expression action format

here the expression is NR/2==0 and the action is replacing the characters with N with gsub that are not CTAG

JBone
  • 1,724
  • 3
  • 20
  • 32