4

I've written a Ruby script to do the following:

  1. Read a very large (2GB/12,500,000 lines) CSV into SQLite3
  2. Query the db
  3. Output results to a new CSV

In my mind, this seems to be the easiest and most logical way to go about this. This process will need to be configurable and repeated periodically, hence the Script. I'm using SQLite because the data will always come in CSV form (no access to original DB) and it's just easier to offload the processing to an (easily changeable) SQL statement.

The problem is that steps 1 and 2 take such a long time. I've searched for ways to improve the performance of SQLite. I've implemented some of these suggestions, with limited success.

  • In-memory instance of SQLite3
  • Use transaction (around step 1)
  • Use a prepared statement
  • PRAGMA synchronous = OFF
  • PRAGMA journal_mode = MEMORY (not sure if this help when using in-memory DB)

After all these, I get the following times:

  • Read time: 17m 28s
  • Query time: 14m 26s
  • Write time: 0m 4s
  • Elapsed time: 31m 58s

Granted that I'm using a different language to the post mentioned above and there are differences such as compiled/interpreted, however the insert times are approx 79,000 vs 12,000 record/second - That's 6x slower.

I've also tried indexing the some (or all) of the fields. This actually has the opposite effect. The indexing takes so long that any improvement in query time is completely overshadowed by the indexing time. Additionally, doing that in-memory DB eventually leads to an out of memory error due to the extra space required.

Is SQLite3 not the right DB for this amount of data? I've tried the same using MySQL, but its performance was even worse.

Lastly, here's a chopped down version of the code (some irrelevant niceties removed).

require 'csv'
require 'sqlite3'

inputFile = ARGV[0]
outputFile = ARGV[1]
criteria1 = ARGV[2]
criteria2 = ARGV[3]
criteria3 = ARGV[4]

begin
    memDb = SQLite3::Database.new ":memory:"
    memDb.execute "PRAGMA synchronous = OFF"
    memDb.execute "PRAGMA journal_mode = MEMORY"

    memDb.execute "DROP TABLE IF EXISTS Area"
    memDb.execute "CREATE TABLE IF NOT EXISTS Area (StreetName TEXT, StreetType TEXT, Locality TEXT, State TEXT, PostCode INTEGER, Criteria1 REAL, Criteria2 REAL, Criteria3 REAL)" 
    insertStmt = memDb.prepare "INSERT INTO Area VALUES(?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8)"

    # Read values from file
    readCounter = 0
    memDb.execute "BEGIN TRANSACTION"
    blockReadTime = Time.now
    CSV.foreach(inputFile) { |line|

        readCounter += 1
        break if readCounter > 100000
        if readCounter % 10000 == 0
            formattedReadCounter = readCounter.to_s.reverse.gsub(/...(?=.)/,'\&,').reverse
            print "\rReading line #{formattedReadCounter} (#{Time.now - blockReadTime}s)     " 
            STDOUT.flush
            blockReadTime = Time.now
        end

        insertStmt.execute (line[6]||"").gsub("'", "''"), (line[7]||"").gsub("'", "''"), (line[9]||"").gsub("'", "''"), line[10], line[11], line[12], line[13], line[14]
    }
    memDb.execute "END TRANSACTION"
    insertStmt.close

    # Process values
    sqlQuery = <<eos
    SELECT DISTINCT
        '*',
        '*',
        Locality,
        State,
        PostCode
    FROM
        Area
    GROUP BY
        Locality,
        State,
        PostCode
    HAVING
        MAX(Criteria1) <= #{criteria1}
        AND
        MAX(Criteria2) <= #{criteria2}
        AND
        MAX(Criteria3) <= #{criteria3}
    UNION
    SELECT DISTINCT
        StreetName,
        StreetType,
        Locality,
        State,
        PostCode
    FROM
        Area
    WHERE
        Locality NOT IN (
            SELECT
                Locality
            FROM
                Area
            GROUP BY
                Locality
            HAVING
                MAX(Criteria1) <= #{criteria1}
                AND
                MAX(Criteria2) <= #{criteria2}
                AND
                MAX(Criteria3) <= #{criteria3}
            )
    GROUP BY
        StreetName,
        StreetType,
        Locality,
        State,
        PostCode
    HAVING
        MAX(Criteria1) <= #{criteria1}
        AND
        MAX(Criteria2) <= #{criteria2}
        AND
        MAX(Criteria3) <= #{criteria3}
eos
    statement = memDb.prepare sqlQuery

    # Output to CSV
    csvFile = CSV.open(outputFile, "wb")
    resultSet = statement.execute
    resultSet.each { |row|  csvFile << row}
    csvFile.close

rescue SQLite3::Exception => ex
    puts "Excepion occurred: #{ex}"
ensure
    statement.close if statement
    memDb.close if memDb
end

Please feel free to poke fun at my naive Ruby coding - what don't kill me will hopefully make me a stronger coder.

Community
  • 1
  • 1
maccaroo
  • 819
  • 2
  • 12
  • 22
  • Might be my lack of Ruby skills, but you do not seem to be using batch insert? – Mirko Adari Feb 01 '13 at 06:04
  • Have you tried something like this? [MySQL CSV Storage Engine](http://dev.mysql.com/doc/refman/5.0/en/csv-storage-engine.html) – Zach Kemp Feb 01 '13 at 06:04
  • @Mirko Could you clarify what you mean by batch insert? The only way that I can see to push the data into SQLite3 is by using insert statements. Also, from what I've read, even using the " INSERT INTO targetTable (fields) (values)[, (values)]* " format doesn't improve efficiency – maccaroo Feb 01 '13 at 06:24
  • @Zach I tried MySQL to compare performance, though not the CSV storage engine. I need a solution which is portable - i.e. I need to be able to give the script to anyone, without the need for them to install other software. That's why the SQLite3 in-memory solution was chosen. – maccaroo Feb 01 '13 at 06:28
  • How slow is index creation? – CL. Feb 01 '13 at 08:14
  • Adding an index to a single field takes about 210s. Btw, my machine has a SSD, so it's no slouch, although that wouldn't really affect the index creation, which would be in memory only. – maccaroo Feb 01 '13 at 08:39
  • Apologies for chiming in weeks late, but i can't help thinking you're using the wrong tools to solve this problem. The read/write profile here is obviously tremendously skewed towards insertion and this is born out by your benchmarks. Why not consider a tool, such as Redis, that is designed for such a profile? Granted you'd need to learn a different querying language/paradigm and also figure out how to start/stop the redis server from your script, but i suspect that would be worth it if you could get the whole process down to a couple of minutes. – exbinary Feb 12 '13 at 16:45
  • And just to push my earlier comment a little further, once you go down the road of keeping thing in-memory, you might find that performing all the processing in ruby (without a DB) is the fastest of all options. I realize you want to be able to change the query easily but a good deal of flexibility could be built into the script with a few generic grouping and filtering functions. Anyway, just food for thought. – exbinary Feb 12 '13 at 16:47
  • @an4rcho Thanks for the Redis suggestion. I'll try look into it. Ruby was chosen because I'm vaguely familiar with it, it's extremely easy to knock up a functioning app. The app was a 'quick win' and (in its current form) will likely only be used once a month. I'm not sure what future versions will require. – maccaroo Feb 13 '13 at 02:23
  • Yeah, i also default to ruby for these types of problems - it's more the choice of DB that i was questioning. Redis works great with ruby. Further, ruby has such great built-in support for working with collections that i'd also look into doing all your processing in memory without a DB. I've done this sort of csv munging in the past so let me know if you want some code to look at. Cheers. – exbinary Feb 13 '13 at 02:37

1 Answers1

1

In general, you should try UNION ALL instead of UNION, if possible, so that the two subqueries do not have to be checked for duplicates. However, in this case, SQLite then has to execute the DISTINCT in a separate step. Whether this is faster or not depends on your data.

According to my EXPLAIN QUERY PLAN experiments, the following two indexes should help most with this query:

CREATE INDEX i1 ON Area(Locality, State, PostCode);
CREATE INDEX i2 ON Area(StreetName, StreetType, Locality, State, PostCode);
CL.
  • 173,858
  • 17
  • 217
  • 259
  • Firstly, thanks for mentioning the DISTINCT - I've removed it since it's made redundant by the GROUP BY. Also, I've used the UNION ALL to remove the implied DISTINCT, as that's not necessary. Finally, I'm giving it a run with the multi-field indexes now. Previously I was using a number of single field indexes... which shows my inexperience in SQL optimisations. Will let you know how that goes. – maccaroo Feb 01 '13 at 08:55
  • Here are the new numbers with CL's suggested index: **CREATE INDEX i1 (233.44s)** **CREATE INDEX i1 (238.836s)** **Read time: 17m 40s** **Index time: 7m 52s** **Query time: 0m 54s** **Write time: 0m 3s** **Elapsed time: 26m 31s** Wow - what a difference that index made! I had no idea that was the correct way to create the index. Previously, out of desperation, I had created 8 separate indexes (1 for each field), but that just chewed time creating them, with no benefit to the query. – maccaroo Feb 01 '13 at 09:06