0

I am trying to manipulate a FASTA file with the general format:

>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG

I was attempting to take the read (ACTG...) and append it to the end of the row with the ReadID using

paste -sd "\t\n" input.file > output.file

This works just as it should, except that for whatever reason, some of the reads are intentionally split over two lines:

>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTG
ACTG

This means I can't just simply replace line breaks with tab-delimiters.

I guess the thing to do is to take all lines that fall between lines starting with > and combine them into a single line. How might I go about combining all lines that fall between > into a single line?

Ed Morton
  • 188,023
  • 17
  • 78
  • 185

3 Answers3

1

UPDATE: per comment from OP, the number of lines between lines-starting-with-> can vary; updating my answer ...

Assumptions:

  • 1st/2nd line to be appended with a single space ( )
  • 2nd/3rd/.../nth lines to be appended without intervening space

Sample input data:

$ cat fasta.dat
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG

One awk idea:

$ awk '/^>/ {printf "%s%s ", pfx, $0; pfx="\n"; next} {printf "%s", $0} END {print ""}' fasta.dat
>ReadID other text ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text ACTGACTGACTGACTGACTGACTGACTGACTGACTG
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • The problem is that not every other line starts with `>`. Sometimes there are two lines between, sometimes there are 3 lines between, sometimes there may even be 10 lines between. If they were only ever two lines apart, my original code works. – Billy Mills Aug 13 '21 at 16:58
  • consider updating the question with an example that spans more then 2 lines **AND** the expected output (eg, do you place a single space between the contents from lines 2,3,4,...n?) – markp-fuso Aug 13 '21 at 17:08
  • Made a new question: https://stackoverflow.com/questions/68776062/combining-lines-that-occur-between-lines-with-a-certain-character – Billy Mills Aug 13 '21 at 17:11
  • @BillyMills that new question (currently) doesn't require appending the `>` line with the following lines; for time being, until we've more sample data, I've updated this answer to address what I think you're looking for when there are more than 2 lines to be appended – markp-fuso Aug 13 '21 at 17:19
  • @BillyMills Did you try the solution below by choroba ? It works for me. – JRFerguson Aug 13 '21 at 17:23
1

You can use the following Perl one-liner to make each read one-line:

perl -ne 'sub out {return unless chomp @_; print shift, "\n", @_, "\n" } if (/^>/) {out(@buffer); @buffer = ()} push @buffer, $_; END {out(@buffer)}' -- input.fasta

Which corresponds to the following script:

# Subroutine which prints a header and concatenates the following lines.
sub out {
    return unless chomp @_;       # Remove newlines. Do nothing if there's no buffer.
    print shift, "\n", @_, "\n";  # Print the first line, newline, remaining lines, and newline.
}
if (/^>/) {        # If the line starts with a ">",
    out(@buffer);  # output the previous read
    @buffer = ();  # and empty the buffer.
}
push @buffer, $_;  # Store the current line to the buffer.
END {
    out(@buffer);  # Output the final read.
}
choroba
  • 231,213
  • 25
  • 204
  • 289
0

Using sed:

$ sed ':a;N;$!ba;s/\n\([ACGT]\)/ \1/g' file

Output:

>ReadID other text ACTGACTGACTGACTGACTGACTGACTGACTGACTG
...

Explanation here.

James Brown
  • 36,089
  • 7
  • 43
  • 59