5

I asked two related questions ( How can I speed up fetching the results after running an sqlite query? and Is it normal that sqlite.fetchall() is so slow?). I've changed some things around and got some speed-up, but it still takes over an hour for the select statement to finish.

I've got a table feature which contains an rtMin, rtMax, mzMin and mzMax value. These values together are the corners of a rectangle (if you read my old questions, I save these values separatly instead of getting the min() and max() from the convexhull table, works faster).
And I got a table spectrum with an rt and an mz value. I have a table which links features to spectra when the rtand mzvalue of the spectrum is in the rectangle of the feature.

To do this I use the following sql and python code to retrieve the ids of the spectrum and feature:

self.cursor.execute("SELECT spectrum_id, feature_table_id "+
                    "FROM `spectrum` "+
                    "INNER JOIN `feature` "+
                    "ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "WHERE spectrum.scan_start_time >= feature.rtMin "+
                    "AND spectrum.scan_start_time <= feature.rtMax "+
                    "AND spectrum.base_peak_mz >= feature.mzMin "+
                    "AND spectrum.base_peak_mz <= feature.mzMax")        
spectrumAndFeature_ids = self.cursor.fetchall()

for spectrumAndFeature_id in spectrumAndFeature_ids:
        spectrum_has_feature_inputValues = (spectrumAndFeature_id[0], spectrumAndFeature_id[1])
        self.cursor.execute("INSERT INTO `spectrum_has_feature` VALUES (?,?)",spectrum_has_feature_inputValues)

I timed the execute, fetchall and insert times and got the following:

query took: 74.7989799976 seconds
5888.845541 seconds since fetchall
returned a length of: 10822
inserting all values took: 3.29669690132 seconds

So this query takes about one and a half hours, most of that time doing the fetchall(). How can I speed this up? Should I do the rt and mz comparison in python code?


Update:

To show what indexes I got, here are the create statements for the tables:

CREATE  TABLE IF NOT EXISTS `feature` (
  `feature_table_id` INT PRIMARY KEY NOT NULL ,
  `feature_id` VARCHAR(40) NOT NULL ,
  `intensity` DOUBLE NOT NULL ,
  `overallquality` DOUBLE NOT NULL ,
  `charge` INT NOT NULL ,
  `content` VARCHAR(45) NOT NULL ,
  `intensity_cutoff` DOUBLE NOT NULL,
  `mzMin` DOUBLE NULL ,
  `mzMax` DOUBLE NULL ,
  `rtMin` DOUBLE NULL ,
  `rtMax` DOUBLE NULL ,
  `msrun_msrun_id` INT NOT NULL ,
  CONSTRAINT `fk_feature_msrun1`
    FOREIGN KEY (`msrun_msrun_id` )
    REFERENCES `msrun` (`msrun_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE UNIQUE INDEX `id_UNIQUE` ON `feature` (`feature_table_id` ASC);
  CREATE INDEX `fk_feature_msrun1` ON `feature` (`msrun_msrun_id` ASC);



CREATE  TABLE IF NOT EXISTS `spectrum` (
  `spectrum_id` INT PRIMARY KEY NOT NULL ,
  `spectrum_index` INT NOT NULL ,
  `ms_level` INT NOT NULL ,
  `base_peak_mz` DOUBLE NOT NULL ,
  `base_peak_intensity` DOUBLE NOT NULL ,
  `total_ion_current` DOUBLE NOT NULL ,
  `lowest_observes_mz` DOUBLE NOT NULL ,
  `highest_observed_mz` DOUBLE NOT NULL ,
  `scan_start_time` DOUBLE NOT NULL ,
  `ion_injection_time` DOUBLE,
  `binary_data_mz` BLOB NOT NULL,
  `binaray_data_rt` BLOB NOT NULL,
  `msrun_msrun_id` INT NOT NULL ,
  CONSTRAINT `fk_spectrum_msrun1`
    FOREIGN KEY (`msrun_msrun_id` )
    REFERENCES `msrun` (`msrun_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE INDEX `fk_spectrum_msrun1` ON `spectrum` (`msrun_msrun_id` ASC);



CREATE  TABLE IF NOT EXISTS `spectrum_has_feature` (
  `spectrum_spectrum_id` INT NOT NULL ,
  `feature_feature_table_id` INT NOT NULL ,
  CONSTRAINT `fk_spectrum_has_feature_spectrum1`
    FOREIGN KEY (`spectrum_spectrum_id` )
    REFERENCES `spectrum` (`spectrum_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION,
  CONSTRAINT `fk_spectrum_has_feature_feature1`
    FOREIGN KEY (`feature_feature_table_id` )
    REFERENCES `feature` (`feature_table_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE INDEX `fk_spectrum_has_feature_feature1` ON `spectrum_has_feature` (`feature_feature_table_id` ASC);
  CREATE INDEX `fk_spectrum_has_feature_spectrum1` ON `spectrum_has_feature` (`spectrum_spectrum_id` ASC);

update 2:

I have 20938 spectra, 305742 features and 2 msruns. It results in 10822 matches.


update 3:

Using a new index (CREATE INDEX fk_spectrum_msrun1_2 ON spectrum (msrun_msrun_id, base_peak_mz);) and between saves about 20 seconds: query took: 76.4599349499 seconds 5864.15418601 seconds since fetchall


Update 4:

Print from EXPLAIN QUERY PLAN:

(0, 0, 0, u'SCAN TABLE spectrum (~1000000 rows)'), (0, 1, 1, u'SEARCH TABLE feature USING INDEX fk_feature_msrun1 (msrun_msrun_id=?) (~2 rows)') 
Community
  • 1
  • 1
Niek de Klein
  • 8,524
  • 20
  • 72
  • 143

7 Answers7

6

You are correlating two big tables. Some quick math: 300k x 20k = 6 billion rows. If it was just a matter of returning all these rows, then you would surely be I/O bound (but really only on the (O)utput side). However, your where clause filters out almost everything, as you only have 10k rows returned so you are for sure CPU bound here.

SQLite can't use more than one index at a time, with the exception of what is called the "OR optimizations". Furthermore, you don't get any performance gain from inner joins as they "are converted into additional terms of the WHERE clause".

Bottom line is that SQLite will not be able to execute your query as efficiently as say postgresql et al.

I played around with your scenario as I was curious to see by how much your query can be optimized. Ultimately, it seems that the best optimization is to remove all explicit indices (!). It seems that SQLite figures some on-the-fly index/indices that results in better performance than the different approaches I tried.

As a demonstration, consider this schema derived from yours:

CREATE TABLE feature ( -- 300k
    feature_id INTEGER PRIMARY KEY,
    mzMin DOUBLE,
    mzMax DOUBLE,
    rtMin DOUBLE,
    rtMax DOUBLE,
    lnk_feature INT);
CREATE TABLE spectrum ( -- 20k
    spectrum_id INTEGER PRIMARY KEY,
    mz DOUBLE,
    rt DOUBLE,
    lnk_spectrum INT);

feature has 300k rows, and spectrum 20k (the python code that does this is somewhere below). There is no explicit index specified, only implicit ones due to the definition INTEGER PRIMARY KEY:

INTEGER PRIMARY KEY columns aside, both UNIQUE and PRIMARY KEY constraints are implemented by creating an index in the database (in the same way as a "CREATE UNIQUE INDEX" statement would). Such an index is used like any other index in the database to optimize queries. As a result, there often no advantage (but significant overhead) in creating an index on a set of columns that are already collectively subject to a UNIQUE or PRIMARY KEY constraint.

Using the schema above, SQLite mentions it would create an index during the life of the query on lnk_feature:

sqlite> EXPLAIN QUERY PLAN SELECT feature_id, spectrum_id FROM spectrum, feature
   ...> WHERE lnk_feature = lnk_spectrum
   ...>     AND rt >= rtMin AND rt <= rtMax
   ...>     AND mz >= mzMin AND mz <= mzMax;
0|0|0|SCAN TABLE spectrum (~20000 rows)
0|1|1|SEARCH TABLE feature USING AUTOMATIC COVERING INDEX (lnk_feature=?) (~7 rows)

And even though I tested with an index on that column, or other columns, it seems that the fastest way to run that query is without any of those indices.

The fastest I ran the query above using python is 20 minutes. This includes the completion of .fetchall(). You mention that you will have 150 times more rows at some point. I'd start looking into postgresql if I were you ;-)... Note that you could split the work in threads and potentially divide the time to complete the query by the number of threads that will be able to run concurrently (i.e. by the number of CPUs available).

In any case, here is the code I used. Can you run it yourself and report back how fast the query ran in your environment. Note that I am using apsw, so if you can't use it you'll need to tweak to use your own sqlite3 module.

#!/usr/bin/python
import apsw, random as rand, time

def populate(cu):
    cu.execute("""
CREATE TABLE feature ( -- 300k
    feature_id INTEGER PRIMARY KEY,
    mzMin DOUBLE, mzMax DOUBLE,
    rtMin DOUBLE, rtMax DOUBLE,
    lnk_feature INT);
CREATE TABLE spectrum ( -- 20k
    spectrum_id INTEGER PRIMARY KEY,
    mz DOUBLE, rt DOUBLE,
    lnk_spectrum INT);""")
    cu.execute("BEGIN")
    for i in range(300000):
        ((mzMin, mzMax), (rtMin, rtMax)) = (get_min_max(), get_min_max())
        cu.execute("INSERT INTO feature VALUES (NULL,%s,%s,%s,%s,%s)" 
                    % (mzMin, mzMax, rtMin, rtMax, get_lnk()))
    for i in range(20000):
        cu.execute("INSERT INTO spectrum VALUES (NULL,%s,%s,%s)"
                    % (get_in_between(), get_in_between(), get_lnk()))
    cu.execute("COMMIT")
    cu.execute("ANALYZE")

def get_lnk():
    return rand.randint(1, 2)

def get_min_max():
    return sorted((rand.normalvariate(0.5, 0.004), 
                   rand.normalvariate(0.5, 0.004)))

def get_in_between():
    return rand.normalvariate(0.5, 0.49)

def select(cu):
    sql = """
    SELECT feature_id, spectrum_id FROM spectrum, feature
    WHERE lnk_feature = lnk_spectrum
        AND rt >= rtMin AND rt <= rtMax
        AND mz >= mzMin AND mz <= mzMax"""
    start = time.time()
    cu.execute(sql)
    print ("%s rows; %.2f seconds" % (len(cu.fetchall()), time.time() - start))

cu = apsw.Connection('foo.db').cursor()
populate(cu)
select(cu)

Output I get:

54626 rows; 1210.96 seconds
  • My output is: 4675 rows; 1463.49 seconds, I will try it out on my real script now. Thanks! – Niek de Klein May 07 '12 at 08:48
  • My script is now down to: query took: 100.947947025 seconds fetchall took: 1485.43059111 seconds Still not as fast as I wanted it, but a lot better! Thanks! Wish I could give more vote-ups. – Niek de Klein May 07 '12 at 12:26
2

Make it better on the sql part.

In one word, use INDEXES!

aF.
  • 64,980
  • 43
  • 135
  • 198
1

Use between instead of >= and <= for range comparsion.

self.cursor.execute("SELECT spectrum_id, feature_table_id "+
                        "FROM `spectrum` "+
                        "INNER JOIN `feature` "+
                        "ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                        "WHERE spectrum.scan_start_time between feature.rtMin " + 
                        "AND feature.rtMax "+
                        "AND spectrum.base_peak_mz between feature.mzMin "+
                        "AND feature.mzMax")   

You can create non clustered index on spectrum.scan_start_time ,feature.rtMin ,feature.rtMax ,spectrum.base_peak_mz, m feature.mzMin and feature.mzMax fields.

Romil Kumar Jain
  • 20,239
  • 9
  • 63
  • 92
1

In a normal RDBMS should do a hash join between spectrum and features tables. If you can force it do to a hash join, the query should fly.

However, can you try to a single query?

self.cursor.execute("INSERT INTO `spectrum_has_feature` " + 
                    "SELECT spectrum_id, feature_table_id "+
                    "FROM `spectrum` "+
                    "INNER JOIN `feature` "+
                    "ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "WHERE spectrum.scan_start_time >= feature.rtMin "+
                    "AND spectrum.scan_start_time <= feature.rtMax "+
                    "AND spectrum.base_peak_mz >= feature.mzMin "+
                    "AND spectrum.base_peak_mz <= feature.mzMax")  
Florin Ghita
  • 17,525
  • 6
  • 57
  • 76
1

I ran Ludo's script and it reported 1451 seconds on my system. I then added the following index, which cut the time to 875 seconds (40% decrease):

CREATE INDEX idx1 ON feature (lnk_feature, mzMin, mzMax, rtMin, rtMax);

Still not blindingly fast, but better. Here's the output of EXPLAIN QUERY PLAN:

0|0|0|SCAN TABLE spectrum (~20000 rows)
0|1|1|SEARCH TABLE feature USING COVERING INDEX idx1 (lnk_feature=? AND mzMin<?) (~7 rows)

Note that this is a covering index, which means that SQLite can read all the fields it needs from the index and can therefore never has to read from the feature table. It also is able to use the index for two of the five conditions in the WHERE clause (instead of only one).

sendai
  • 21
  • 1
0

Change this

CREATE INDEX `fk_spectrum_msrun1` ON `spectrum` (`msrun_msrun_id` ASC);

to one of these (whichever is more selective)

CREATE INDEX `fk_spectrum_msrun1_1` ON `spectrum` (`msrun_msrun_id`, `scan_start_time`);
CREATE INDEX `fk_spectrum_msrun1_2` ON `spectrum` (`msrun_msrun_id`, `base_peak_mz`);

The first one may be able to speed up the comparisons on scan_start_time and the second one the comparisons on base_peak_mz. Because these are inequality comparisons, an index on both columns is not useful.

Janne Karila
  • 24,266
  • 6
  • 53
  • 94
0

1.Use between instead of <= or =>.

2.Add index on scan_start_time and base_peak_mz