0

Background

I'm currently trying to parse a .txt com zone file. It is structured like so

blahblah.com    xx     xx      examplens.com

Currently my code is structured like so

extern crate regex;
use regex::Regex;
use rusqlite::{params, Connection, Result};


#[derive(Debug)]
struct Domain {
    id: i32,
    domain: String,
}

use std::io::stdin;
fn main() -> std::io::Result<()> {
    
    let mut reader = my_reader::BufReader::open("data/com_practice.txt")?;
    let mut buffer = String::new();
    let conn: Connection = Connection::open("/Users/alex/Projects/domain_randomizer/data/domains.db").unwrap();
    
    while let Some(line) = reader.read_line(&mut buffer) {
        let regexed_i_think = rexeginton_the_domain_only(line?.trim());
        println!("{:?}", regexed_i_think);
        sqliting(regexed_i_think, &conn).unwrap();
    }
    
    let mut stmt = conn.prepare("SELECT id, domain FROM domains").unwrap();

    let domain_iter = stmt.query_map(params![], |row| {
        Ok(Domain {
            id: row.get(0)?,
            domain: row.get(1)?,
        })
    }).unwrap();

    for domain in domain_iter {
        println!("Found person {:?}", domain.unwrap());
    }

    Ok(())
}

pub fn sqliting(domain_name: &str, conn: &Connection) -> Result<()> {

    let yeah = Domain {
        id: 0,
        domain: domain_name.to_string()
    };

    conn.execute(
        "INSERT INTO domains (domain) VALUES (?1)",
        params![yeah.domain],
    )?;

    Ok(())
}

mod my_reader {
    use std::{
        fs::File,
        io::{self, prelude::*},
    };

    pub struct BufReader {
        reader: io::BufReader<File>,
    }

    impl BufReader {
        pub fn open(path: impl AsRef<std::path::Path>) -> io::Result<Self> {
            let file = File::open(path)?;
            let reader = io::BufReader::new(file);

            Ok(Self { reader })
        }

        pub fn read_line<'buf>(
            &mut self,
            buffer: &'buf mut String,
        ) -> Option<io::Result<&'buf mut String>> {
            buffer.clear();

            self.reader
                .read_line(buffer)
                .map(|u| if u == 0 { None } else { Some(buffer) })
                .transpose()
        }
    }
}

pub fn rexeginton_the_domain_only(full_line: &str) -> &str {
    let regex_str = Regex::new(r"(?m).*?.com").unwrap();
    let final_str = regex_str.captures(full_line).unwrap().get(0).unwrap().as_str();
    return final_str;
}

Issue

So I am Parsing a single domain at a time, each time making an Insert. As I've gathered, Inserting would be far more efficient if I was making thousands of Inserts in a single transaction. However, I'm not quite sure what an efficient approach is to refactoring my parsing around this.

Question

How should I reorient my parsing process around my Insertion process? Also how can I actively gauge the speed and efficiency of the process in the first place so I can compare and contrast in an articulate manner?

Alex Langsam
  • 9
  • 1
  • 1
  • 3
    Use a prepared insert statement, and transactions. – Shawn Nov 15 '21 at 03:09
  • 3
    See [Improving INSERT-per-second performance of sqlite](https://stackoverflow.com/q/1711631/9952196) for tips (C, but probably adaptable to Rust) – Shawn Nov 15 '21 at 03:12
  • As @Shawn said, the main bottleneck is insert into SQLite. Insert in batches using prepared statement. Other issue is relatively minor: avoid constructing regex for every parsed line. – Aivean Nov 15 '21 at 03:53
  • Note that the way your code is written, `conn.execute` waits until each line has been written to disk before moving to the next line. Using a prepared statement and especially a transaction as suggested by @Shawn will allow the code to proceed while the data is being written in the background. – Jmb Nov 15 '21 at 09:50
  • Also, don't re-create the regex every row. Perhaps avoiding the regex completely can help, but I'm not sure about that. – Chayim Friedman Nov 16 '21 at 06:14

1 Answers1

0
  1. open the database with journal_mode=WAL and, if you don't need durability, with synchronous=off
  2. create a transaction
  3. create a prepared statement
  4. repeatedly execute the prepared statement

As a sanity check you could also profile your program to see if the assumption is correct that sqlite is the bottleneck.

the8472
  • 40,999
  • 5
  • 70
  • 122