3

I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):

>id1

sequence 1 # could be on several line

>id2

sequence 2

...

I have run a tools that give me in tsv format:

the Identifiant, the length, and the position in bytes of the identifiant.

for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.

# this fonction will get the sequence using seek
def get_seq(file, bites):  

    with open(file) as f_:
        f_.seek(bites, 0) # go to the line of interest
        line = f_.readline().strip() # this line is the begin of the 
                                     #sequence
        to_return = "" # init the string which will contains the sequence

        while not line.startswith('>') or not line:  # while we do not 
                                                     # encounter another identifiant
        to_return += line
        line = f_.readline().strip()

    return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):

    with open(out_file, 'a') as out_file:
        out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))

# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:

    indice = 0

    for line in ref:

        spt = line.split()
        id_ = spt[0]
        seq = get_seq(args.i, int(spt[2]))
        write_seq(out_file=args.out, id_=id_, sequence=seq)

my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?

RomainL.
  • 997
  • 1
  • 10
  • 24
  • I haven't tried with 10^8 sequences, but for an example file I have containing around 4.5M short sequences, the following Biopython based approach worked using a few Gbytes of RAM: `sortedrecs = sorted(SeqIO.parse("/tmp/test.fasta", format="fasta"), key = lambda rec : len(rec.seq))`. This is probably not the lightest way to represent fasta sequences in python, so there may be room for improvement. Another idea could be to modify the fastq-sort code so that it uses sequence length for sorting (http://homes.cs.washington.edu/~dcjones/fastq-tools/). – bli Dec 20 '16 at 18:08
  • I got something that seems to work based on fastq-tools, but it takes fastq formatted input: http://paste.ubuntu.com/23660236/ – bli Dec 20 '16 at 18:44
  • I made a pull request to add this to fastq-tools: https://github.com/dcjones/fastq-tools/pull/18 – bli Dec 20 '16 at 18:59
  • I added an answer based on this modified fastq-sort: http://stackoverflow.com/a/41278639/1878788 – bli Dec 22 '16 at 08:30

4 Answers4

2

Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:

subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:

    for line in ref:

        spt = line.split()
        id_ = spt[0]
        subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
heathobrien
  • 1,027
  • 7
  • 11
  • samtoosl faidx Is what I use to index my file. – RomainL. Dec 20 '16 at 12:48
  • For the files opening you are clearly right I gain a lot of speed this way thanks. – RomainL. Dec 20 '16 at 12:50
  • In fact the documentation of samtools faidx http://www.htslib.org/doc/faidx.html states ""In order to be indexed with samtools faidx, a FASTA file must be a text file of the form" so I guess faidx is a tool for indexing not get sequence from indexed fasta – RomainL. Dec 20 '16 at 13:00
  • Like a lot of bioinformatic software, the documentation is sketchy, but it says [here](http://www.htslib.org/doc/samtools.html) that faidx can "Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create .fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format." If you just give the sequence name without coordinates, it treats the entire sequence as the region – heathobrien Dec 20 '16 at 16:18
  • Then this is certainly more efficient than python, thanks. – RomainL. Dec 20 '16 at 16:25
  • be sure to call it on the fasta file, not the index – heathobrien Dec 20 '16 at 16:29
2

What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.

csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'

for file in tmp_fasta_file_*
do
  TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
  FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done

for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
  cat "$filename" >> $2
done

rm tmp_fasta_file*

First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta

pjanek
  • 76
  • 1
  • 5
  • This is interesting and obviously faster but it had to create more than 10**8 subfile each sub file has a minimum size of 4,0kb is not a good solution but I should definitely look for C/C++ code than python – RomainL. Dec 20 '16 at 15:25
  • Why 10**8 subfiles? This script creates file for each sequence, not for each line... – pjanek Dec 20 '16 at 15:35
  • I have more than 10**8 sequences is my bad I will edit the question. To be more specifics – RomainL. Dec 20 '16 at 15:41
  • OK, that's change everything ;) So 4kb limit is painfull, because you have a lot of short sequences? – pjanek Dec 20 '16 at 15:50
  • The 4kb is just the weight of an empty files, am I wrong? but yes basically there are sequences between 100 and 500 characters. – RomainL. Dec 20 '16 at 16:09
  • Empty file has 0kb, but every not empty file has at least 4kb (very rarely this value can be different) due to filesystem block size. – pjanek Dec 20 '16 at 16:15
0

Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta:

cat test.fasta \
    | tee >(wc -l > nb_lines_fasta.txt) \
    | bioawk -c fastx '{l = length($seq); printf "@"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
    | tee >(wc -l > nb_lines_fastq.txt) \
    | fastq-sort -L \
    | tee >(wc -l > nb_lines_fastq_sorted.txt) \
    | bioawk -c fastx '{print ">"$name"\n"$seq}' \
    | tee >(wc -l > nb_lines_fasta_sorted.txt) \
    > test_sorted.fasta

The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.

The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.

In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short.

The resulting fasta file will have the shorter sequences first. To get the longest sequences at the top of the file, add the -r option to fastq-sort.

Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot.

Edit

I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:

cat test.fasta \
    | bioawk -c fastx '{print "@"$name"\n"$seq"\n+\n"$seq}' \
    | fastq-sort -L \
    | bioawk -c fastx '{print ">"$name"\n"$seq}' \
    > test_sorted.fasta

This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.

bli
  • 7,549
  • 7
  • 48
  • 94
0

You can also do it very conveniently with awk, check the code below:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta  |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -k1,1n | cut -f 2- | tr "\t" "\n"

This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.

elcortegano
  • 2,444
  • 11
  • 40
  • 58
  • Can you develop how this method would be faster on a large fasta file? It seems to that this method will be really slow because it do the sort after a pipe https://stackoverflow.com/questions/43362433/why-using-pipe-for-sort-linux-command-is-slow – RomainL. Mar 18 '21 at 16:09
  • Haven't done a benchmarking myself, but if you are concerned about performance, I'd try first `sortbyname.sh` from BBMap then. It has the "inconvenience" that you have to install it, but apart from that it is a more established tool, and allows sorting using other criteria. – elcortegano Mar 18 '21 at 17:40