Concatenate the sequence to the ID in fasta file

Question

Here is my input file

>OTU1;size=4;
ATTCCGGGTTTACT
ATTCCTTTTATCGA
ATC
>OTU2;size=10;
CGGATCTAGGCGAT
ACT
>OTU3;size=5;
ATTCCCGGGATCTA
ACTTTTC

The expected output file is:

>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC

I've tried the code from Remove line breaks in a FASTA file

but this doesn't work for me, and I am not sure how to modify the code from that post... Any suggestion? Thanks in advance!

score 2 · Answer 1 · answered Jun 10 '19 at 22:41

2

$ awk '{printf "%s%s", (/^>/ ? ors : ""), $0; ors=ORS} END{print ""}' file
>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC

answered Jun 10 '19 at 22:41

Ed Morton

188,023
17
78
185

1

Smart, concise and elegant solution. please explain the need for `ors` variable. Thanks – Dudi Boy Jun 10 '19 at 23:54
Thanks. You need to print null for the first line and ORS afterwards so setting that variable after the first line is processed takes care of that. It's equivalent to writing `(/^>/ ? (NR>1 > ORS : "") : ""), $0` which I consider more cryptic. – Ed Morton Jun 11 '19 at 00:18

Dudi Boy · Accepted Answer · 2019-06-11T09:06:21.787

2

Here is another awk script. Using the awk internal parsing mechanism.

awk 'BEGIN{RS=">";OFS="";}NR>1{$1=$1;print ">"$0}' input.txt

Output is:

>OTU1;size=4;ATTCCGGGTTTACTATTCCTTTTATCGAATC
>OTU2;size=10;CGGATCTAGGCGATACT
>OTU3;size=5;ATTCCCGGGATCTAACTTTTC

Explanation:

awk '
BEGIN {        # initialize awk internal variables
  RS=">";      # set `RS`=record separator to `>`
  OFS="";      # set `OFS`=output field separator to empty string.
}
NR>1 {         # handle from 2nd record (1st record is empty).
  $1=$1;       # regenerate the output line
  print ">"$0  # print out ">" with computed output line
}' input.txt

edited Jun 11 '19 at 09:06

answered Jun 10 '19 at 23:43

Dudi Boy

4,551
1
15
30

Very nice usage of the internal variables. – kvantour Jun 11 '19 at 09:05
You still need to set `FS="\n"`. Also, it might be wiser to use `FNR>1` instead of `NR>1` imagine you want to process two files. – kvantour Jun 11 '19 at 10:05
thanks for the comment, I check the `FS` has an implicit default to `\n` and there is not way to override it (hope I understand correctly). See here https://www.math.utah.edu/docs/info/gawk_6.html#SEC45. – Dudi Boy Jun 11 '19 at 10:34
You are correct that `FS` is by default blanks or newlines, but you are incorrect that you cannot override the newline. when `FS` is two characters, it is assumed to be a regular expression, then you overwrite the newline as a field splitter. (See [POSIX awk standard](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html) section regular expressions) – kvantour Jun 11 '19 at 11:04
Well, turns out I stumbled across this before, see https://lists.gnu.org/archive/html/bug-gawk/2019-04/msg00029.html, and the gawk maintainer is going to update the gawk documentation to record the difference and work to get the POSIX standard changed so it describes the way gawk (and some other awks) behaves which is to add `\n` to `FS` only when `FS` is a single char but if `FS` is a regexp then not add `\n` to it. All of this still only applies to the case where `RS` is null, of course. – Ed Morton Jun 11 '19 at 14:58
@kvantour the version of the gawk manual you referenced is out of date, see https://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line for a more recent version that addresses this behavior - `NOTE: When FS is the null string ("") or a regexp, this special feature of RS does not apply. It does apply to the default field separator of a single space: ‘FS = " "’. Note that language in the POSIX specification implies that this special feature should apply when FS is a regexp. However, Unix awk has never behaved that way, nor has gawk. This is essentially a bug in POSIX.` – Ed Morton Jun 11 '19 at 15:08
so wrt the script in this particular answer where RS is **not** null - you'd only need to set `FS` to `\n` if the lines contained chains of white space that you didn't want removed by `$1=$1` with `OFS=""`. Otherwise `\n` is already included in the field separator white space by the default FS of `" "` and so, like all other white space, will be stripped by `$1=$1` when `OFS=""` as it is here. – Ed Morton Jun 11 '19 at 15:22

RavinderSingh13 · Answer 3 · 2019-06-11T09:36:03.563

1

Could you please try following too.

awk -v RS=">" 'NR>1{gsub(/\n/,"");print ">"$0}'  Input_file

My original attempt was awk -v RS=">" -v FS="\n" -v OFS="" 'NF>1{$1=$1;print ">"$0}' Input_file but later I saw it is already answered buy dudi boy so written another(first mentioned) one.

edited Jun 11 '19 at 09:36

answered Jun 11 '19 at 01:50

RavinderSingh13

130,504
14
57
93

1

I think you have an error in here. Should it not be (NR>1)? – kvantour Jun 11 '19 at 09:07
1

Smart, concise and elegant solution. – Dudi Boy Jun 11 '19 at 09:12
@kvantour, thank you sir for letting know, edited it now. – RavinderSingh13 Jun 11 '19 at 09:57

score 0 · Answer 4 · answered Jun 11 '19 at 14:45

0

Concatenate the sequence to the ID in fasta file

4 Answers4