I asked two related questions ( How can I speed up fetching the results after running an sqlite query? and Is it normal that sqlite.fetchall() is so slow?). I've changed some things around and got some speed-up, but it still takes over an hour for the select statement to finish.
I've got a table feature
which contains an rtMin
, rtMax
, mzMin
and mzMax
value. These values together are the corners of a rectangle (if you read my old questions, I save these values separatly instead of getting the min() and max() from the convexhull
table, works faster).
And I got a table spectrum
with an rt
and an mz
value. I have a table which links features to spectra when the rt
and mz
value of the spectrum is in the rectangle of the feature.
To do this I use the following sql and python code to retrieve the ids of the spectrum and feature:
self.cursor.execute("SELECT spectrum_id, feature_table_id "+
"FROM `spectrum` "+
"INNER JOIN `feature` "+
"ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"WHERE spectrum.scan_start_time >= feature.rtMin "+
"AND spectrum.scan_start_time <= feature.rtMax "+
"AND spectrum.base_peak_mz >= feature.mzMin "+
"AND spectrum.base_peak_mz <= feature.mzMax")
spectrumAndFeature_ids = self.cursor.fetchall()
for spectrumAndFeature_id in spectrumAndFeature_ids:
spectrum_has_feature_inputValues = (spectrumAndFeature_id[0], spectrumAndFeature_id[1])
self.cursor.execute("INSERT INTO `spectrum_has_feature` VALUES (?,?)",spectrum_has_feature_inputValues)
I timed the execute, fetchall and insert times and got the following:
query took: 74.7989799976 seconds
5888.845541 seconds since fetchall
returned a length of: 10822
inserting all values took: 3.29669690132 seconds
So this query takes about one and a half hours, most of that time doing the fetchall(). How can I speed this up? Should I do the rt
and mz
comparison in python code?
Update:
To show what indexes I got, here are the create statements for the tables:
CREATE TABLE IF NOT EXISTS `feature` (
`feature_table_id` INT PRIMARY KEY NOT NULL ,
`feature_id` VARCHAR(40) NOT NULL ,
`intensity` DOUBLE NOT NULL ,
`overallquality` DOUBLE NOT NULL ,
`charge` INT NOT NULL ,
`content` VARCHAR(45) NOT NULL ,
`intensity_cutoff` DOUBLE NOT NULL,
`mzMin` DOUBLE NULL ,
`mzMax` DOUBLE NULL ,
`rtMin` DOUBLE NULL ,
`rtMax` DOUBLE NULL ,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_feature_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE UNIQUE INDEX `id_UNIQUE` ON `feature` (`feature_table_id` ASC);
CREATE INDEX `fk_feature_msrun1` ON `feature` (`msrun_msrun_id` ASC);
CREATE TABLE IF NOT EXISTS `spectrum` (
`spectrum_id` INT PRIMARY KEY NOT NULL ,
`spectrum_index` INT NOT NULL ,
`ms_level` INT NOT NULL ,
`base_peak_mz` DOUBLE NOT NULL ,
`base_peak_intensity` DOUBLE NOT NULL ,
`total_ion_current` DOUBLE NOT NULL ,
`lowest_observes_mz` DOUBLE NOT NULL ,
`highest_observed_mz` DOUBLE NOT NULL ,
`scan_start_time` DOUBLE NOT NULL ,
`ion_injection_time` DOUBLE,
`binary_data_mz` BLOB NOT NULL,
`binaray_data_rt` BLOB NOT NULL,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_spectrum_msrun1` ON `spectrum` (`msrun_msrun_id` ASC);
CREATE TABLE IF NOT EXISTS `spectrum_has_feature` (
`spectrum_spectrum_id` INT NOT NULL ,
`feature_feature_table_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_has_feature_spectrum1`
FOREIGN KEY (`spectrum_spectrum_id` )
REFERENCES `spectrum` (`spectrum_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `fk_spectrum_has_feature_feature1`
FOREIGN KEY (`feature_feature_table_id` )
REFERENCES `feature` (`feature_table_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_spectrum_has_feature_feature1` ON `spectrum_has_feature` (`feature_feature_table_id` ASC);
CREATE INDEX `fk_spectrum_has_feature_spectrum1` ON `spectrum_has_feature` (`spectrum_spectrum_id` ASC);
update 2:
I have 20938 spectra, 305742 features and 2 msruns. It results in 10822 matches.
update 3:
Using a new index (CREATE INDEX fk_spectrum_msrun1_2
ON spectrum
(msrun_msrun_id
, base_peak_mz
);) and between saves about 20 seconds:
query took: 76.4599349499 seconds
5864.15418601 seconds since fetchall
Update 4:
Print from EXPLAIN QUERY PLAN:
(0, 0, 0, u'SCAN TABLE spectrum (~1000000 rows)'), (0, 1, 1, u'SEARCH TABLE feature USING INDEX fk_feature_msrun1 (msrun_msrun_id=?) (~2 rows)')