Combining lines that occur between lines that contain a certain character

Question

I am trying to manipulate a FASTA file with the general format:

>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG

I was attempting to take the read (ACTG...) and append it to the end of the row with the ReadID using

paste -sd "\t\n" input.file > output.file

This works just as it should, except that for whatever reason, some of the reads are intentionally split over two lines:

>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTG
ACTG

This means I can't just simply replace line breaks with tab-delimiters.

I guess the thing to do is to take all lines that fall between lines starting with > and combine them into a single line. How might I go about combining all lines that fall between > into a single line?

markp-fuso · Answer 1 · 2021-08-13T17:59:58.653

1

UPDATE: per comment from OP, the number of lines between lines-starting-with-> can vary; updating my answer ...

Assumptions:

1st/2nd line to be appended with a single space ( )
2nd/3rd/.../nth lines to be appended without intervening space

Sample input data:

$ cat fasta.dat
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text
ACTGACTGACTGACTGACTGACTGACTGACTGACTG

One awk idea:

$ awk '/^>/ {printf "%s%s ", pfx, $0; pfx="\n"; next} {printf "%s", $0} END {print ""}' fasta.dat
>ReadID other text ACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
>ReadID other text ACTGACTGACTGACTGACTGACTGACTGACTGACTG

edited Aug 13 '21 at 17:59

answered Aug 13 '21 at 15:34

markp-fuso

28,790
4
16
36

The problem is that not every other line starts with `>`. Sometimes there are two lines between, sometimes there are 3 lines between, sometimes there may even be 10 lines between. If they were only ever two lines apart, my original code works. – Billy Mills Aug 13 '21 at 16:58
consider updating the question with an example that spans more then 2 lines **AND** the expected output (eg, do you place a single space between the contents from lines 2,3,4,...n?) – markp-fuso Aug 13 '21 at 17:08
Made a new question: https://stackoverflow.com/questions/68776062/combining-lines-that-occur-between-lines-with-a-certain-character – Billy Mills Aug 13 '21 at 17:11
@BillyMills that new question (currently) doesn't require appending the `>` line with the following lines; for time being, until we've more sample data, I've updated this answer to address what I think you're looking for when there are more than 2 lines to be appended – markp-fuso Aug 13 '21 at 17:19
@BillyMills Did you try the solution below by choroba ? It works for me. – JRFerguson Aug 13 '21 at 17:23

score 1 · Answer 2 · answered Aug 13 '21 at 15:49

You can use the following Perl one-liner to make each read one-line:

perl -ne 'sub out {return unless chomp @_; print shift, "\n", @_, "\n" } if (/^>/) {out(@buffer); @buffer = ()} push @buffer, $_; END {out(@buffer)}' -- input.fasta

Which corresponds to the following script:

# Subroutine which prints a header and concatenates the following lines.
sub out {
    return unless chomp @_;       # Remove newlines. Do nothing if there's no buffer.
    print shift, "\n", @_, "\n";  # Print the first line, newline, remaining lines, and newline.
}
if (/^>/) {        # If the line starts with a ">",
    out(@buffer);  # output the previous read
    @buffer = ();  # and empty the buffer.
}
push @buffer, $_;  # Store the current line to the buffer.
END {
    out(@buffer);  # Output the final read.
}

score 0 · Answer 3 · answered Aug 13 '21 at 15:42

0

Using sed:

$ sed ':a;N;$!ba;s/\n\([ACGT]\)/ \1/g' file

Output:

>ReadID other text ACTGACTGACTGACTGACTGACTGACTGACTGACTG
...

Explanation here.

answered Aug 13 '21 at 15:42

James Brown

36,089
7
43
59

Combining lines that occur between lines that contain a certain character

3 Answers3