let's say I have a dynamic number of input strings from a file (barcodes). I want to split up a huge 111GB text file based upon matches to the input strings, and write those hits to files.
I don't know how many inputs to expect.
I have done all the file input and string matching, but am stuck at the output step.
Ideally, I would open a file for each input in the input vector barcodes, just containing strings. Are there any approaches to open a dynamic number of output files?
A suboptimal approach is searching for a barcode string as an input arg, but this means I have to read the huge file repeatedly.
The barcode input vector just contains strings, eg "TAGAGTAT", "TAGAGTAG",
Ideally, output should look like this if the previous two strings are input
file1 -> TAGAGTAT.txt
file2 -> TAGAGTAG.txt
Thanks for your help.
extern crate needletail;
use needletail::{parse_fastx_file, Sequence, FastxReader};
use std::str;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;
fn read_barcodes () -> Vec<String> {
// TODO - can replace this with file reading code (OR move to an arguments based model, parse and demultiplex only one oligomer at a time..... )
// The `vec!` macro can be used to initialize a vector or strings
let barcodes = vec![
"TCTCAAAG".to_string(),
"AACTCCGC".into(),
"TAAACGCG".into()
];
println!("Initial vector: {:?}", barcodes);
return barcodes
}
fn main() {
//let filename = "test5m.fastq";
let filename = "Undetermined_S0_R1.fastq";
println!("Fastq filename: {} ", filename);
//println!("Barcodes filename: {} ", barcodes_filename);
let barcodes_vector: Vec<String> = read_barcodes();
let mut counts_vector: [i32; 30] = [0; 30];
let mut n_bases = 0;
let mut n_valid_kmers = 0;
let mut reader = parse_fastx_file(&filename).expect("Not a valid path/file");
while let Some(record) = reader.next() {
let seqrec = record.expect("invalid record");
// get sequence
let sequenceBytes = seqrec.normalize(false);
let sequenceText = str::from_utf8(&sequenceBytes).unwrap();
//println!("Seq: {} ", &sequenceText);
// get first 8 chars (8chars x 2 bytes)
let sequenceOligo = &sequenceText[0..8];
//println!("barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
if sequenceOligo == barcodes_vector[0]{
//println!("Hit ! Barcode vector {}, seqOligo {} ", &barcodes_vector[0], sequenceOligo);
counts_vector[0] = counts_vector[0] + 1;
}