1

let's say I have a dynamic number of input strings from a file (barcodes). I want to split up a huge 111GB text file based upon matches to the input strings, and write those hits to files.

I don't know how many inputs to expect.

I have done all the file input and string matching, but am stuck at the output step.

Ideally, I would open a file for each input in the input vector barcodes, just containing strings. Are there any approaches to open a dynamic number of output files?

A suboptimal approach is searching for a barcode string as an input arg, but this means I have to read the huge file repeatedly.

The barcode input vector just contains strings, eg "TAGAGTAT", "TAGAGTAG",

Ideally, output should look like this if the previous two strings are input

file1 -> TAGAGTAT.txt
file2 -> TAGAGTAG.txt

Thanks for your help.

extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};
use std::str;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;

fn read_barcodes () -> Vec<String> {
    
    // TODO - can replace this with file reading code (OR move to an arguments based model, parse and demultiplex only one oligomer at a time..... )

    // The `vec!` macro can be used to initialize a vector or strings
    let barcodes = vec![
        "TCTCAAAG".to_string(),
        "AACTCCGC".into(),
        "TAAACGCG".into()
        ];
        println!("Initial vector: {:?}", barcodes);
        return barcodes
} 

fn main() {
    //let filename = "test5m.fastq";

    let filename = "Undetermined_S0_R1.fastq";

    println!("Fastq filename: {} ", filename);
    //println!("Barcodes filename: {} ", barcodes_filename);

    let barcodes_vector: Vec<String> = read_barcodes();
    let mut counts_vector: [i32; 30] = [0; 30];

    let mut n_bases = 0;
    let mut n_valid_kmers = 0;
    let mut reader = parse_fastx_file(&filename).expect("Not a valid path/file");
    while let Some(record) = reader.next() {
        let seqrec = record.expect("invalid record");

        // get sequence
        let sequenceBytes = seqrec.normalize(false);
        
        let sequenceText = str::from_utf8(&sequenceBytes).unwrap();
        //println!("Seq: {} ", &sequenceText);

        // get first 8 chars (8chars x 2 bytes)
        let sequenceOligo = &sequenceText[0..8]; 
        //println!("barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
        if sequenceOligo == barcodes_vector[0]{
            //println!("Hit ! Barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
            counts_vector[0] =  counts_vector[0] + 1;

        }  

colindaven
  • 39
  • 5
  • *"Are there any approaches to open a dynamic number of output files"* - `Vec`? Its not clear to me what you want your output to look like. Also, you say you've done the string matching part, but you also seem unsure how to divide the work (?), what exactly do you want help with? – kmdreko Jan 15 '21 at 18:12
  • Vec sounds useful. I just need an example of a) how to properly instantiate a vector of files named after the relevant string b) setup the output file objects properly c) write to those files (though the parser should provide a method) – colindaven Jan 15 '21 at 18:20

2 Answers2

0

You probably want a HashMap<String, File>. You could build it from your barcode vector like this:

use std::collections::HashMap;
use std::fs::File;
use std::path::Path;

fn build_file_map(barcodes: &[String]) -> HashMap<String, File> {
    let mut files = HashMap::new();

    for barcode in barcodes {
        let filename = Path::new(barcode).with_extension("txt");
        let file = File::create(filename).expect("failed to create output file");
        files.insert(barcode.clone(), file);
    }

    files
}

You would call it like this:

let barcodes = vec!["TCTCAAAG".to_string(), "AACTCCGC".into(), "TAAACGCG".into()];
let file_map = build_file_map(&barcodes);

And you would get a file to write to like this:

let barcode = barcodes[0];
let file = file_map.get(&barcode).expect("barcode not in file map");
// write to file
kmdreko
  • 42,554
  • 6
  • 57
  • 106
  • Thanks, that's an excellent answer. I see what I was conceptually missing now. I will attempt to implement this. -> Done, the files were created as desired. Thanks ! – colindaven Jan 15 '21 at 18:48
0

I just need an example of a) how to properly instantiate a vector of files named after the relevant string b) setup the output file objects properly c) write to those files.

Here's a commented example:

use std::io::Write;
use std::fs::File;
use std::io;

fn read_barcodes() -> Vec<String> {
    // read barcodes here
    todo!()
}

fn process_barcode(barcode: &str) -> String {
    // process barcodes here
    todo!()
}

fn main() -> io::Result<()> {
    let barcodes = read_barcodes();
    
    for barcode in barcodes {
        // process barcode to get output
        let output = process_barcode(&barcode);
        
        // create file for barcode with {barcode}.txt name
        let mut file = File::create(format!("{}.txt", barcode))?;
        
        // write output to created file
        file.write_all(output.as_bytes());
    }
    
    Ok(())
}
pretzelhammer
  • 13,874
  • 15
  • 47
  • 98
  • Thanks, that's very clearly structured and useful. Is the Ok(()) at the end to return unit type / 0 / void as mentioned here https://stackoverflow.com/questions/24842271/what-is-the-purpose-of-the-unit-type-in-rust , or have I got that wrong ? – colindaven Jan 15 '21 at 19:13
  • @colindaven it's the same, I added the return type `io::Result<()>` to `main` so I could use the `?` operator on the `File::create` function call, but as a result I also have to end `main` with `Ok(())` if everything went okay. – pretzelhammer Jan 15 '21 at 19:55