Very slow select query, how can I speed this up?

Question

I asked two related questions ( How can I speed up fetching the results after running an sqlite query? and Is it normal that sqlite.fetchall() is so slow?). I've changed some things around and got some speed-up, but it still takes over an hour for the select statement to finish.

I've got a table feature which contains an rtMin, rtMax, mzMin and mzMax value. These values together are the corners of a rectangle (if you read my old questions, I save these values separatly instead of getting the min() and max() from the convexhull table, works faster).
And I got a table spectrum with an rt and an mz value. I have a table which links features to spectra when the rtand mzvalue of the spectrum is in the rectangle of the feature.

To do this I use the following sql and python code to retrieve the ids of the spectrum and feature:

self.cursor.execute("SELECT spectrum_id, feature_table_id "+
                    "FROM `spectrum` "+
                    "INNER JOIN `feature` "+
                    "ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "WHERE spectrum.scan_start_time >= feature.rtMin "+
                    "AND spectrum.scan_start_time <= feature.rtMax "+
                    "AND spectrum.base_peak_mz >= feature.mzMin "+
                    "AND spectrum.base_peak_mz <= feature.mzMax")        
spectrumAndFeature_ids = self.cursor.fetchall()

for spectrumAndFeature_id in spectrumAndFeature_ids:
        spectrum_has_feature_inputValues = (spectrumAndFeature_id[0], spectrumAndFeature_id[1])
        self.cursor.execute("INSERT INTO `spectrum_has_feature` VALUES (?,?)",spectrum_has_feature_inputValues)

I timed the execute, fetchall and insert times and got the following:

query took: 74.7989799976 seconds
5888.845541 seconds since fetchall
returned a length of: 10822
inserting all values took: 3.29669690132 seconds

So this query takes about one and a half hours, most of that time doing the fetchall(). How can I speed this up? Should I do the rt and mz comparison in python code?

Update:

To show what indexes I got, here are the create statements for the tables:

CREATE  TABLE IF NOT EXISTS `feature` (
  `feature_table_id` INT PRIMARY KEY NOT NULL ,
  `feature_id` VARCHAR(40) NOT NULL ,
  `intensity` DOUBLE NOT NULL ,
  `overallquality` DOUBLE NOT NULL ,
  `charge` INT NOT NULL ,
  `content` VARCHAR(45) NOT NULL ,
  `intensity_cutoff` DOUBLE NOT NULL,
  `mzMin` DOUBLE NULL ,
  `mzMax` DOUBLE NULL ,
  `rtMin` DOUBLE NULL ,
  `rtMax` DOUBLE NULL ,
  `msrun_msrun_id` INT NOT NULL ,
  CONSTRAINT `fk_feature_msrun1`
    FOREIGN KEY (`msrun_msrun_id` )
    REFERENCES `msrun` (`msrun_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE UNIQUE INDEX `id_UNIQUE` ON `feature` (`feature_table_id` ASC);
  CREATE INDEX `fk_feature_msrun1` ON `feature` (`msrun_msrun_id` ASC);



CREATE  TABLE IF NOT EXISTS `spectrum` (
  `spectrum_id` INT PRIMARY KEY NOT NULL ,
  `spectrum_index` INT NOT NULL ,
  `ms_level` INT NOT NULL ,
  `base_peak_mz` DOUBLE NOT NULL ,
  `base_peak_intensity` DOUBLE NOT NULL ,
  `total_ion_current` DOUBLE NOT NULL ,
  `lowest_observes_mz` DOUBLE NOT NULL ,
  `highest_observed_mz` DOUBLE NOT NULL ,
  `scan_start_time` DOUBLE NOT NULL ,
  `ion_injection_time` DOUBLE,
  `binary_data_mz` BLOB NOT NULL,
  `binaray_data_rt` BLOB NOT NULL,
  `msrun_msrun_id` INT NOT NULL ,
  CONSTRAINT `fk_spectrum_msrun1`
    FOREIGN KEY (`msrun_msrun_id` )
    REFERENCES `msrun` (`msrun_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE INDEX `fk_spectrum_msrun1` ON `spectrum` (`msrun_msrun_id` ASC);



CREATE  TABLE IF NOT EXISTS `spectrum_has_feature` (
  `spectrum_spectrum_id` INT NOT NULL ,
  `feature_feature_table_id` INT NOT NULL ,
  CONSTRAINT `fk_spectrum_has_feature_spectrum1`
    FOREIGN KEY (`spectrum_spectrum_id` )
    REFERENCES `spectrum` (`spectrum_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION,
  CONSTRAINT `fk_spectrum_has_feature_feature1`
    FOREIGN KEY (`feature_feature_table_id` )
    REFERENCES `feature` (`feature_table_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE INDEX `fk_spectrum_has_feature_feature1` ON `spectrum_has_feature` (`feature_feature_table_id` ASC);
  CREATE INDEX `fk_spectrum_has_feature_spectrum1` ON `spectrum_has_feature` (`spectrum_spectrum_id` ASC);

update 2:

I have 20938 spectra, 305742 features and 2 msruns. It results in 10822 matches.

update 3:

Using a new index (CREATE INDEX fk_spectrum_msrun1_2 ON spectrum (msrun_msrun_id, base_peak_mz);) and between saves about 20 seconds: query took: 76.4599349499 seconds 5864.15418601 seconds since fetchall

Update 4:

Print from EXPLAIN QUERY PLAN:

(0, 0, 0, u'SCAN TABLE spectrum (~1000000 rows)'), (0, 1, 1, u'SEARCH TABLE feature USING INDEX fk_feature_msrun1 (msrun_msrun_id=?) (~2 rows)')

I updated my question to show my create statements with indexes — Niek de Klein, May 04 '12 at 09:01
I think the problem is sqlite. Can it make hash join between spectrum and features? — Florin Ghita, May 04 '12 at 09:03
How many spectrums, features and msruns do you have? How large are the two ranges relative to all the data? — Janne Karila, May 04 '12 at 09:59
I have 20938 spectra, 305742 features and 2 msruns. In the end it will be about 150ish times more. — Niek de Klein, May 04 '12 at 10:11
can you update with the results of executing `explain query plan select ...` where `select ...` is you query? (might be simplest to do from the client) — , May 04 '12 at 15:21

score 6 · Accepted Answer · answered May 05 '12 at 16:36

You are correlating two big tables. Some quick math: 300k x 20k = 6 billion rows. If it was just a matter of returning all these rows, then you would surely be I/O bound (but really only on the (O)utput side). However, your where clause filters out almost everything, as you only have 10k rows returned so you are for sure CPU bound here.

SQLite can't use more than one index at a time, with the exception of what is called the "OR optimizations". Furthermore, you don't get any performance gain from inner joins as they "are converted into additional terms of the WHERE clause".

Bottom line is that SQLite will not be able to execute your query as efficiently as say postgresql et al.

I played around with your scenario as I was curious to see by how much your query can be optimized. Ultimately, it seems that the best optimization is to remove all explicit indices (!). It seems that SQLite figures some on-the-fly index/indices that results in better performance than the different approaches I tried.

As a demonstration, consider this schema derived from yours:

CREATE TABLE feature ( -- 300k
    feature_id INTEGER PRIMARY KEY,
    mzMin DOUBLE,
    mzMax DOUBLE,
    rtMin DOUBLE,
    rtMax DOUBLE,
    lnk_feature INT);
CREATE TABLE spectrum ( -- 20k
    spectrum_id INTEGER PRIMARY KEY,
    mz DOUBLE,
    rt DOUBLE,
    lnk_spectrum INT);

feature has 300k rows, and spectrum 20k (the python code that does this is somewhere below). There is no explicit index specified, only implicit ones due to the definition INTEGER PRIMARY KEY:

INTEGER PRIMARY KEY columns aside, both UNIQUE and PRIMARY KEY constraints are implemented by creating an index in the database (in the same way as a "CREATE UNIQUE INDEX" statement would). Such an index is used like any other index in the database to optimize queries. As a result, there often no advantage (but significant overhead) in creating an index on a set of columns that are already collectively subject to a UNIQUE or PRIMARY KEY constraint.

Using the schema above, SQLite mentions it would create an index during the life of the query on lnk_feature:

sqlite> EXPLAIN QUERY PLAN SELECT feature_id, spectrum_id FROM spectrum, feature
   ...> WHERE lnk_feature = lnk_spectrum
   ...>     AND rt >= rtMin AND rt <= rtMax
   ...>     AND mz >= mzMin AND mz <= mzMax;
0|0|0|SCAN TABLE spectrum (~20000 rows)
0|1|1|SEARCH TABLE feature USING AUTOMATIC COVERING INDEX (lnk_feature=?) (~7 rows)

And even though I tested with an index on that column, or other columns, it seems that the fastest way to run that query is without any of those indices.

The fastest I ran the query above using python is 20 minutes. This includes the completion of .fetchall(). You mention that you will have 150 times more rows at some point. I'd start looking into postgresql if I were you ;-)... Note that you could split the work in threads and potentially divide the time to complete the query by the number of threads that will be able to run concurrently (i.e. by the number of CPUs available).

In any case, here is the code I used. Can you run it yourself and report back how fast the query ran in your environment. Note that I am using apsw, so if you can't use it you'll need to tweak to use your own sqlite3 module.

#!/usr/bin/python
import apsw, random as rand, time

def populate(cu):
    cu.execute("""
CREATE TABLE feature ( -- 300k
    feature_id INTEGER PRIMARY KEY,
    mzMin DOUBLE, mzMax DOUBLE,
    rtMin DOUBLE, rtMax DOUBLE,
    lnk_feature INT);
CREATE TABLE spectrum ( -- 20k
    spectrum_id INTEGER PRIMARY KEY,
    mz DOUBLE, rt DOUBLE,
    lnk_spectrum INT);""")
    cu.execute("BEGIN")
    for i in range(300000):
        ((mzMin, mzMax), (rtMin, rtMax)) = (get_min_max(), get_min_max())
        cu.execute("INSERT INTO feature VALUES (NULL,%s,%s,%s,%s,%s)" 
                    % (mzMin, mzMax, rtMin, rtMax, get_lnk()))
    for i in range(20000):
        cu.execute("INSERT INTO spectrum VALUES (NULL,%s,%s,%s)"
                    % (get_in_between(), get_in_between(), get_lnk()))
    cu.execute("COMMIT")
    cu.execute("ANALYZE")

def get_lnk():
    return rand.randint(1, 2)

def get_min_max():
    return sorted((rand.normalvariate(0.5, 0.004), 
                   rand.normalvariate(0.5, 0.004)))

def get_in_between():
    return rand.normalvariate(0.5, 0.49)

def select(cu):
    sql = """
    SELECT feature_id, spectrum_id FROM spectrum, feature
    WHERE lnk_feature = lnk_spectrum
        AND rt >= rtMin AND rt <= rtMax
        AND mz >= mzMin AND mz <= mzMax"""
    start = time.time()
    cu.execute(sql)
    print ("%s rows; %.2f seconds" % (len(cu.fetchall()), time.time() - start))

cu = apsw.Connection('foo.db').cursor()
populate(cu)
select(cu)

Output I get:

54626 rows; 1210.96 seconds

My output is: 4675 rows; 1463.49 seconds, I will try it out on my real script now. Thanks! — Niek de Klein, May 07 '12 at 08:48
My script is now down to: query took: 100.947947025 seconds fetchall took: 1485.43059111 seconds Still not as fast as I wanted it, but a lot better! Thanks! Wish I could give more vote-ups. — Niek de Klein, May 07 '12 at 12:26

score 2 · Answer 2 · answered May 04 '12 at 08:37

2

Make it better on the sql part.

In one word, use INDEXES!

answered May 04 '12 at 08:37

aF.

64,980
43
135
198

I am using indexes, I updated my question with my create statements – Niek de Klein May 04 '12 at 09:00

Romil Kumar Jain · Answer 3 · 2012-05-04T09:06:14.480

1

Use between instead of >= and <= for range comparsion.

self.cursor.execute("SELECT spectrum_id, feature_table_id "+
                        "FROM `spectrum` "+
                        "INNER JOIN `feature` "+
                        "ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                        "WHERE spectrum.scan_start_time between feature.rtMin " + 
                        "AND feature.rtMax "+
                        "AND spectrum.base_peak_mz between feature.mzMin "+
                        "AND feature.mzMax")

You can create non clustered index on spectrum.scan_start_time ,feature.rtMin ,feature.rtMax ,spectrum.base_peak_mz, m feature.mzMin and feature.mzMax fields.

edited May 04 '12 at 09:06

answered May 04 '12 at 08:36

Romil Kumar Jain

20,239
9
63
92

1

That won't make it fast. The query optimizer knows that `>=` and `<=` is `between`. – aF. May 04 '12 at 08:38
The query optimizer _should_ know that `>=` and `<=` is `between`. – Matthias May 04 '12 at 08:46
Trying this out now, will take a bit of time to see how much effect it will have – Niek de Klein May 04 '12 at 09:01
It did not speed up my query. – Niek de Klein May 04 '12 at 11:00

score 1 · Answer 4 · answered May 04 '12 at 09:06

In a normal RDBMS should do a hash join between spectrum and features tables. If you can force it do to a hash join, the query should fly.

However, can you try to a single query?

self.cursor.execute("INSERT INTO `spectrum_has_feature` " + 
                    "SELECT spectrum_id, feature_table_id "+
                    "FROM `spectrum` "+
                    "INNER JOIN `feature` "+
                    "ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "WHERE spectrum.scan_start_time >= feature.rtMin "+
                    "AND spectrum.scan_start_time <= feature.rtMax "+
                    "AND spectrum.base_peak_mz >= feature.mzMin "+
                    "AND spectrum.base_peak_mz <= feature.mzMax")

SQLite doesn't have hash joins, only loop joins, as far as I know. — Dan D., May 04 '12 at 09:48
That explains everything. You can't speed-up loops. He should try mysql. — Florin Ghita, May 04 '12 at 11:47

score 1 · Answer 5 · answered Sep 16 '12 at 22:09

I ran Ludo's script and it reported 1451 seconds on my system. I then added the following index, which cut the time to 875 seconds (40% decrease):

CREATE INDEX idx1 ON feature (lnk_feature, mzMin, mzMax, rtMin, rtMax);

Still not blindingly fast, but better. Here's the output of EXPLAIN QUERY PLAN:

0|0|0|SCAN TABLE spectrum (~20000 rows)
0|1|1|SEARCH TABLE feature USING COVERING INDEX idx1 (lnk_feature=? AND mzMin<?) (~7 rows)

Note that this is a covering index, which means that SQLite can read all the fields it needs from the index and can therefore never has to read from the feature table. It also is able to use the index for two of the five conditions in the WHERE clause (instead of only one).

Janne Karila · Answer 6 · 2012-05-04T10:34:35.353

0

Change this

CREATE INDEX `fk_spectrum_msrun1` ON `spectrum` (`msrun_msrun_id` ASC);

to one of these (whichever is more selective)

CREATE INDEX `fk_spectrum_msrun1_1` ON `spectrum` (`msrun_msrun_id`, `scan_start_time`);
CREATE INDEX `fk_spectrum_msrun1_2` ON `spectrum` (`msrun_msrun_id`, `base_peak_mz`);

The first one may be able to speed up the comparisons on scan_start_time and the second one the comparisons on base_peak_mz. Because these are inequality comparisons, an index on both columns is not useful.

edited May 04 '12 at 10:34

answered May 04 '12 at 09:56

Janne Karila

24,266
6
53
94

How do I know which one is more selective? Going to try them out now. – Niek de Klein May 04 '12 at 09:58
Which comparison narrows down the dataset more, the rtMin/Max one, or the mzMin/Max one? – Janne Karila May 04 '12 at 10:10
CREATE INDEX `fk_spectrum_msrun1_1` ON `spectrum` (`msrun_msrun_id`, `scan_start_time`); didn't give a considerable speed-up, I'm going to try the other now. – Niek de Klein May 04 '12 at 10:52
The other extra index also didn't speed up the query – Niek de Klein May 04 '12 at 11:05

score 0 · Answer 7 · answered May 04 '12 at 10:39

0

1.Use between instead of <= or =>.

2.Add index on scan_start_time and base_peak_mz

answered May 04 '12 at 10:39

CodeMonkey

1
1

both things have already been suggested, between didn't give a significant speed-up and I'm testing the indexing – Niek de Klein May 04 '12 at 10:51

Very slow select query, how can I speed this up?

Update:

update 2:

update 3:

Update 4:

7 Answers7