How can I speed up the process of parsing a 40Gb txt file and inserting a row, per line, into an sqlite db

Question

Background

I'm currently trying to parse a .txt com zone file. It is structured like so

blahblah.com    xx     xx      examplens.com

Currently my code is structured like so

extern crate regex;
use regex::Regex;
use rusqlite::{params, Connection, Result};


#[derive(Debug)]
struct Domain {
    id: i32,
    domain: String,
}

use std::io::stdin;
fn main() -> std::io::Result<()> {
    
    let mut reader = my_reader::BufReader::open("data/com_practice.txt")?;
    let mut buffer = String::new();
    let conn: Connection = Connection::open("/Users/alex/Projects/domain_randomizer/data/domains.db").unwrap();
    
    while let Some(line) = reader.read_line(&mut buffer) {
        let regexed_i_think = rexeginton_the_domain_only(line?.trim());
        println!("{:?}", regexed_i_think);
        sqliting(regexed_i_think, &conn).unwrap();
    }
    
    let mut stmt = conn.prepare("SELECT id, domain FROM domains").unwrap();

    let domain_iter = stmt.query_map(params![], |row| {
        Ok(Domain {
            id: row.get(0)?,
            domain: row.get(1)?,
        })
    }).unwrap();

    for domain in domain_iter {
        println!("Found person {:?}", domain.unwrap());
    }

    Ok(())
}

pub fn sqliting(domain_name: &str, conn: &Connection) -> Result<()> {

    let yeah = Domain {
        id: 0,
        domain: domain_name.to_string()
    };

    conn.execute(
        "INSERT INTO domains (domain) VALUES (?1)",
        params![yeah.domain],
    )?;

    Ok(())
}

mod my_reader {
    use std::{
        fs::File,
        io::{self, prelude::*},
    };

    pub struct BufReader {
        reader: io::BufReader<File>,
    }

    impl BufReader {
        pub fn open(path: impl AsRef<std::path::Path>) -> io::Result<Self> {
            let file = File::open(path)?;
            let reader = io::BufReader::new(file);

            Ok(Self { reader })
        }

        pub fn read_line<'buf>(
            &mut self,
            buffer: &'buf mut String,
        ) -> Option<io::Result<&'buf mut String>> {
            buffer.clear();

            self.reader
                .read_line(buffer)
                .map(|u| if u == 0 { None } else { Some(buffer) })
                .transpose()
        }
    }
}

pub fn rexeginton_the_domain_only(full_line: &str) -> &str {
    let regex_str = Regex::new(r"(?m).*?.com").unwrap();
    let final_str = regex_str.captures(full_line).unwrap().get(0).unwrap().as_str();
    return final_str;
}

Issue

So I am Parsing a single domain at a time, each time making an Insert. As I've gathered, Inserting would be far more efficient if I was making thousands of Inserts in a single transaction. However, I'm not quite sure what an efficient approach is to refactoring my parsing around this.

Question

How should I reorient my parsing process around my Insertion process? Also how can I actively gauge the speed and efficiency of the process in the first place so I can compare and contrast in an articulate manner?

See [Improving INSERT-per-second performance of sqlite](https://stackoverflow.com/q/1711631/9952196) for tips (C, but probably adaptable to Rust) — Shawn, Nov 15 '21 at 03:12
As @Shawn said, the main bottleneck is insert into SQLite. Insert in batches using prepared statement. Other issue is relatively minor: avoid constructing regex for every parsed line. — Aivean, Nov 15 '21 at 03:53
Note that the way your code is written, `conn.execute` waits until each line has been written to disk before moving to the next line. Using a prepared statement and especially a transaction as suggested by @Shawn will allow the code to proceed while the data is being written in the background. — Jmb, Nov 15 '21 at 09:50
Also, don't re-create the regex every row. Perhaps avoiding the regex completely can help, but I'm not sure about that. — Chayim Friedman, Nov 16 '21 at 06:14

score 0 · Answer 1 · answered Jul 09 '23 at 12:28

open the database with journal_mode=WAL and, if you don't need durability, with synchronous=off
create a transaction
create a prepared statement
repeatedly execute the prepared statement

As a sanity check you could also profile your program to see if the assumption is correct that sqlite is the bottleneck.

How can I speed up the process of parsing a 40Gb txt file and inserting a row, per line, into an sqlite db

Background

Issue

Question

1 Answers1