0

I am using SQLite in Python for a big file management system. I have a big flat file (100 millions lines) that I want to sort using the values of 3 columns (which are integers), so that I could iterate and do some computation.

I used SQLite with a big SELECT ... ORDER BY (with an index on one column). Since this big SELECT is too memory demanding I need to call it several times (with OFFSET and LIMIT).

I could use Linux sort, but I want it to be platform independent. It is working fine (as long as the right PRAGMA are correctly set), but slow. How to optimize this?

Commands are like:

PRAGMA journal_mode = OFF
PRAGMA synchronous = 0
PRAGMA locking_mode = EXCLUSIVE
PRAGMA count_change = OFF
PRAGMA temp_store = 2
CREATE TABLE tmpTranscripts_arm_3R_transcripts (id INTEGER PRIMARY KEY, name varchar(255), chromosome varchar(255), start int(11), end int(11), direction tinyint(4), tags varchar(1023), bin int(11), exons varchar(10000))
CREATE INDEX 'iTranscript_arm_3R_14943' ON 'tmpTranscripts_arm_3R_transcripts' (start, end, direction)
INSERT INTO tmpTranscripts_arm_3R_transcripts (name, chromosome, start, end, direction, tags, bin, exons) VALUES ('SRR060644.1', 'arm_3R', 11450314, 11450337, -1, 'feature=transcript;bestRegion=(self);nbGaps=0;nbMismatches=0;ID=SRR060644.1;identity=100.0', 300011450, '')
(this, more than 10 millions times)
SELECT * FROM tmpTranscripts_arm_3R_transcripts ORDER BY start, end, direction LIMIT 0, 10000
(this, as much as needed)
user4157124
  • 2,809
  • 13
  • 27
  • 42
unamourdeswann
  • 445
  • 3
  • 14
  • Have you created INDEXes for these 3 columns? Adding indexes really increase speed of sort operations. But of course it has its price - database will be larger and inserting operation will be slower. – Zuljin Oct 05 '11 at 09:48
  • If 3 separate indexes won't work well than maybe it will be possible to create 1 multi-column index if your SELECT is order always in the same way. Check this answer http://stackoverflow.com/questions/179085/multiple-indexes-vs-multi-column-indexes – Zuljin Oct 05 '11 at 10:07
  • Well, it seems it did not work much better... It would need ~8 hours to iterate over "only" 10 millions lines... and I tried both solutions... Another idea? – unamourdeswann Oct 05 '11 at 15:48

1 Answers1

2

I have written some sample script that create your database and go through all its elements. And it looks like it works much faster than your wrote in comments. Are you sure that database access is a bottleneck? Maybe in your script you do something more and this takes so much time.

I have checked 2 databases SQLite and MongoDB with 5 millions of items. For SQLite inserting all rows took ~1200 seconds and selection them around 300 seconds. MongoDB was faster and insert took ~400 seconds while select less than 100 seconds.

Please check your code with my samples and check if your select is similar. I used cursor instead of LIMIT/OFFSET. If this still doesn't help then I think MongoDB is worth a shot. It has one disadvantage - it require 64-bit OS to support large database (like yours). If you newer used it before then here is shortest installation guide for windows:

And here are my python 3.x test scripts for SQLite

import sqlite3
from time import time

conn = sqlite3.connect('test.dbase')

c = conn.cursor()

c.execute("""PRAGMA journal_mode = OFF""")
c.execute("""PRAGMA synchronous = 0""")
c.execute("""PRAGMA locking_mode = EXCLUSIVE""")
c.execute("""PRAGMA count_change = OFF""")
c.execute("""PRAGMA temp_store = 2""")

c.execute("""CREATE TABLE tmpTranscripts_arm_3R_transcripts (id INTEGER PRIMARY KEY, name varchar(255), chromosome varchar(255), start int(11), end int(11), direction tinyint(4), tags varchar(1023), bin int(11), exons varchar(10000))""")
c.execute("""CREATE INDEX 'iTranscript_arm_3R_14943' ON 'tmpTranscripts_arm_3R_transcripts' (start, end, direction)""")

t1 = time()

for i in range(0, 5000000):
    c.execute("""INSERT INTO tmpTranscripts_arm_3R_transcripts (name, chromosome, start, end, direction, tags, bin, exons) VALUES ('SRR060644.1', 'arm_3R', %d, %d, %d, 'feature=transcript;bestRegion=(self);nbGaps=0;nbMismatches=0;ID=SRR060644.1;identity=100.0', 300011450, '')""" % ((i+123)%352, (i+523)%422, (i+866)%536))
    if(not i%10000):
        print("Insert:", i)

t2 = time()
print("Insert time", t2-t1)

conn.commit()

t1 = time()
c.execute("""SELECT * FROM tmpTranscripts_arm_3R_transcripts ORDER BY start, end, direction""")

i = 0
for row in c:
    a = row[0]
    if(not i%10000):
        print("Get:", i, row)
    i+=1

t2 = time()
print("Sort time", t2-t1)

c.close()

and for MongoDB

from pymongo import Connection
from pymongo import ASCENDING, DESCENDING
from time import time

connection = Connection()
connection = Connection('localhost', 27017)
db = connection['test-database']
collection = db['test-collection']
posts = db.posts

posts.create_index([("start", ASCENDING), ("end", ASCENDING), ("direction", ASCENDING)])

t1 = time()

for i in range(0, 5000000):
    post = { "name": 'SRR060644.1',
            "chromosome": 'arm_3R',
            "start": (i+123)%352,
            "end": (i+523)%422,
            "direction": (i+866)%536,
            "tags": 'feature=transcript;bestRegion=(self);nbGaps=0;nbMismatches=0;ID=SRR060644.1;identity=100.0',
            "bin": 300011450,
            "exons": ''}

    posts.insert(post)

    if(not i%10000):
        print("Insert:", i)

t2 = time()
print("Insert time", t2-t1)

t1 = time()

i = 0
for post in posts.find().sort([("start", ASCENDING), ("end", ASCENDING), ("direction", ASCENDING)]):
    if(not i%10000):
        print("Get:", i, post)
    i+=1

t2 = time()
print("Sort time", t2-t1)
Zuljin
  • 2,612
  • 17
  • 14