0

I have scenario when I have to iterate through multiple tables in quite big sqlite database. In tables I store informations about planet position on sky through years. So e.g. for Mars I have tables Mars_2000, Mars_2001 and so on. Table structure is always the same:

|id:INTEGER|date:TEXT|longitude:REAL|

Thing is that for certain task I need to iterate through this tables, which cost much time (for more than 10 queries it's painful).

I suppose that if I merge all tables with years to one big table performance might be better as one query through one big table is better than 50 through smaller tables. I wanted to make sure that this might work, as database is humongous (around 20Gb), and reshaping it would cost a while.

Is this plan I just described viable? Is there any other solution for such case?

It might be helpfull so I attach function that produces my SQL query that is unique for each table:

pub fn transition_query(
    select_param: &str, // usually asterix
    table_name: &str, // table I'd like to query
    birth_degree: &f64, // constant number
    wanted_degree: &f64, // another constant number
    orb: &f64, // another constant number
    upper_date_limit: DateTime<Utc>, // casts to SQL-like string
    lower_date_limit: DateTime<Utc>, // casts to SQL-like string
) -> String {
    let parsed_upper_date_limit = CelestialBodyPosition::parse_date(upper_date_limit);
    let parsed_lower_date_limit = CelestialBodyPosition::parse_date(lower_date_limit);
    return format!("
    SELECT *,(SECOND_LAG>60 OR SECOND_LAG IS NULL) AS TRANSIT_START, (SECOND_LEAD > 60 OR SECOND_LEAD IS NULL) AS TRANSIT_END, time FROM (
        SELECT 
          *, 
          UNIX_TIME - LAG(UNIX_TIME,1) OVER (ORDER BY time) as SECOND_LAG,
          LEAD(UNIX_TIME,1) OVER (ORDER BY time) - UNIX_TIME as SECOND_LEAD FROM (
            SELECT {select_param}, 
              DATE(time) as day_scoped_date,
              CAST(strftime('%s', time) AS INT) AS UNIX_TIME,
              longitude
              FROM {table_name} 
                  WHERE  ((-{orb} <= abs(realModulo(longitude -{birth_degree} -{wanted_degree},360)) 
                  AND abs(realModulo(longitude -{birth_degree} -{wanted_degree},360)) <= {orb})
                  OR
                  (-{orb} <= abs(realModulo(longitude -{birth_degree} +{wanted_degree},360)) 
                  AND abs(realModulo(longitude -{birth_degree} +{wanted_degree},360)) <= {orb}))
                  AND time < '{parsed_upper_date_limit}' AND time > '{parsed_lower_date_limit}'
            )
        ) WHERE (TRANSIT_START AND NOT TRANSIT_END) OR (TRANSIT_END AND NOT TRANSIT_START) ;
    ");
}
  • Can you explain a little more about what your "certain task" is, it might help us provide an answer. As it is, we can't say whether or not your suggested solution is a good way forward or not. – Cassandra S. Jun 30 '22 at 06:11
  • This is not an appropriate use of SQLite. SQLite is designed primarily for convenience, not performance. If you have a multi-gigabyte dataset, then you need to be using a real database server product, like MySQL or Postgres. Remember that SQLite doesn't even use real data types; everything is stored as a string. – Tim Roberts Jun 30 '22 at 06:18
  • TimRoberts unfortunetly, this is local application, so I have no option to use fully fledged database server. @CassandraS. Well, I have to check condition through every single record. If it matches I'd like to select record, otherwise I'd like to skip it. I can provide my SQL but it's quite messy. Case is: I'm selecting rows that matches my condition for every table. – Mateusz Pydych Jun 30 '22 at 06:20
  • @TimRoberts If there is a better candidate for local database I'd love to know it. I haven't found any better choice than SQLite. Most of local databases that I've found are in-memory ones like NeDB, which are not suitable, as database itself is huge – Mateusz Pydych Jun 30 '22 at 06:34
  • And what's the condition? – Cassandra S. Jun 30 '22 at 06:54
  • @CassandraS. see SQL I've attached to question. Basically: 1. `time` value from row should be in range 2. `longitude` value is used to calculate some sort of constant that should be in range between -`orb` to `orb` (in most cases betwen -5 to 5. I want to stress that query itself seems to not to be a problem, rather amount of queries causes whole process to prelongate. – Mateusz Pydych Jun 30 '22 at 07:01
  • @TimRoberts *Remember that SQLite doesn't even use real data types; everything is stored as a string.* Not true. SQLite has string, integer, double, blob, and null types and different ways to store each internally. See https://www.sqlite.org/datatype3.html and https://www.sqlite.org/fileformat2.html#record_format – Shawn Jun 30 '22 at 08:59
  • You can run MySQL on your local system. You would be happier. – Tim Roberts Jun 30 '22 at 18:39

1 Answers1

0

I solved the issue programmatically. Whole thing was done with Rust and r2d2_sqlite library. I'm still doing a lot of queries, but now it's done in threads. It allowed me to reduce execution time from 25s to around 3s. Here's the code:

use std::sync::mpsc;
use std::thread;
use r2d2_sqlite::SqliteConnectionManager;
use r2d2;

    let manager = SqliteConnectionManager::file("db_path");
    let pool = r2d2::Pool::builder().build(manager).unwrap();
    let mut result: Vec<CelestialBodyPosition> = vec![]; // Vector of structs
    let (tx, rx) = mpsc::channel(); // Allows ansynchronous communication
    let mut children = vec![]; //vector of join handlers (not sure if needed at all

    for query in queries {
        let pool = pool.clone(); // For each loop I clone connection to databse
        let inner_tx = tx.clone(); // and messager, as each thread should have spearated one.
        children.push(thread::spawn(move || {
            let conn = pool.get().unwrap();
            add_real_modulo_function(&conn); // this adds custom sqlite function I needed
            let mut sql =  conn.prepare(&query).unwrap();
            // this does query, and maps result to my internal type
            let positions: Vec<CelestialBodyPosition> = sql 
                .query_map(params![], |row| {
                    Ok(CelestialBodyPosition::new(row.get(1)?, row.get(2)?))
                })
                .unwrap()
                .map(|position| position.unwrap())
                .collect();
                // this sends partial result to receiver
                return inner_tx.send(positions).unwrap();
        }));
    }
    // first messenger has to be dropped, otherwise program will wait for its input
    drop(tx);
    for received in rx {
        result.extend(received); // combine all results
    }
    return result;

As you can see no optimization happened from sqlite site, which kinda makes me feel I'm doing something wrong, but for now it's alright. It might be good to press some more control over amount of spawned threads.