0

I'm working on FASTA files. FASTA files are used in biology to store sequences.

>sequence1 identifier (a string)

sequence on one or several line (a string)

...

>last sequence identifier (a string)

the sequence on one or several line (a string)

I want to make an iterator that returns a struct while reading the file:

struct fasta_seq {
    identifier: String,
    sequence: String,
}

In Python, it's easy. This code returns a tuple but the idea is the same

def get_seq_one_by_one(file_):
    """Generator return prompt sequence for each sequence"""

   sequence = ''
   prompt = ''

   for line in file_:

       if line.startswith('>'):

           if sequence:
               yield (prompt, sequence)

           sequence = ''
           prompt = line.strip()[1:]

       else:
            sequence += line.strip()

   yield (prompt, sequence)

This is convenient and allows me to make clearer code because I can iterate through the file with a simple for loop.

for identifier, sequence in get_seq_one_by_one(open_file):
    do

I found similar topics:

If I understand correctly, they know the size of the buffer to read. In my case I don't know it because the identifier and/or sequence length may change.

I have checked and using Rust's yield seems to not be a great idea, because is described as unstable.

I do not want you to code in my place, I am trying to learn by rewriting a script I have done in Python in Rust. I don't know what to use here to answer my problem.

If you can point out the overall idea how to achieve this goal, it would be really nice. If there is no need for unsafe code it will be even better.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
RomainL.
  • 997
  • 1
  • 10
  • 24
  • 2
    There are some crates for parsing FASTA files: [bio](https://crates.io/crates/bio) seems to be me most used one, is that enough for you ? – Grégory OBANOS Mar 13 '18 at 12:40
  • they have implemented what I am looking for now I have to understand it. thanks – RomainL. Mar 13 '18 at 12:46
  • @RomainL. One strength of Rust is the ease of use extern crate. Do not forget to search if something already exists before reinventing the wheel. – Boiethios Mar 13 '18 at 12:56

2 Answers2

1

I managed to get something working. It's clearly a beginner-level implementation but it works.

struct FastaSeq {
    identifier: String,
    sequence: String,
}

// come from: https://docs.rs/bio/0.17.0/src/bio/io/fasta.rs.html#7-1013
struct FastaReader<R: io::Read> {
    reader: io::BufReader<R>,
}

// come from: https://docs.rs/bio/0.17.0/src/bio/io/fasta.rs.html#7-1013
impl<R: io::Read> FastaReader<R> {
    /// Create a new Fasta reader given an instance of `io::Read`.
    pub fn new(reader: R) -> Self {
        FastaReader {
            reader: io::BufReader::new(reader),
        }
    }
}

impl<B: BufRead> Iterator for FastaReader<B> {
    type Item = Result<FastaSeq, io::Error>;

    fn next(&mut self) -> Option<Result<FastaSeq, io::Error>> {
        let mut this_string = String::new();
        let mut buf = vec![];
        match self.reader.read_until(b'>', &mut buf) {
            Ok(0) => None,

            Ok(my_buf) => {
                this_string = from_utf8(&buf).unwrap().to_string();
                if this_string == ">" {
                    self.reader.read_until(b'>', &mut buf);
                    this_string = from_utf8(&buf).unwrap().to_string();
                }
                this_string = this_string.trim_matches('>').to_string();

                let split_str = this_string.split("\n");
                let split_vec = split_str.collect::<Vec<&str>>();
                let identifier = split_vec[0];
                let sequence = split_vec[1..].join("");
                return Some(Ok(FastaSeq {
                    identifier: identifier.to_string(),
                    sequence: sequence.to_string(),
                }));
            }

            Err(e) => Some(Err(e)),
        }
    }
}
RomainL.
  • 997
  • 1
  • 10
  • 24
0

As said in a comment, the better is to use an existing crate. If you want absolutely write your own code, you must try something like:

use std::io::Read;
use std::fs::File;

struct FastaSeq {
    identifier: String,
    sequence: String,
}

fn main() {
    fn process_file(file_name: &str) -> Result<Vec<FastaSeq>, std::io::Error> {
        let mut lines = String::new();
        File::open(file_name)?.read_to_string(&mut lines)?;
        let mut ret = Vec::new();
        let mut lines = lines.split('\n');

        // I assume that the first line begin with '>'
        while let Some(line) = lines.by_ref().next() {
            ret.push(FastaSeq {
                identifier: line.into(),
                sequence: lines
                    .by_ref()
                    .take_while(|s| s.chars().next().map(|c| c != '>').unwrap_or(false))
                    .collect(),
            });
        }
        Ok(ret)
    }

    if let Err(e) = process_file("your file") {
        println!("An error occured: {}", e);
        std::process::exit(1);
    }
}
Boiethios
  • 38,438
  • 19
  • 134
  • 183
  • 1
    Usually I use what exist but in this case is because I'am trying to learn. I have a question relative to your answer: It seems to me that you reading the file and push all struct in a Vec. Does it is possible to make an iterator over the bufReader? When dealing on very large file you want to avoid put all in memory. – RomainL. Mar 13 '18 at 13:00
  • 1
    @RomainL. Yes, you can. You can read a file line by line: see [this question](https://stackoverflow.com/questions/31192956/whats-the-de-facto-way-of-reading-and-writing-files-in-rust-1-x) Buffered read is useful if you have **very large files**, like in the order of GBytes. – Boiethios Mar 13 '18 at 15:30