3

I am going to join two tables by using a single position in one table to the range (represented by two columns) in another table.

However, the performance is too slow, which is about 20 mins. I have tried adding the index on the table or changing the query. But the performance is still poor.

So, I am asking for optimization of the joining speed.


The following is the query to MySQL.

mysql> SELECT `inVar`.chrom, `inVar`.pos, `openChrom_K562`.score
    -> FROM `inVar`
    -> LEFT JOIN `openChrom_K562`
    -> ON (
    -> `inVar`.chrom=`openChrom_K562`.chrom AND
    -> `inVar`.pos BETWEEN `openChrom_K562`.chromStart AND `openChrom_K562`.chromEnd
    -> );

inVar and openChrom_K562 are the tables I used.

inVar stores the single position in each row.

openChrom_K562 stores the range information indicated by chromStart and chromEnd.

inVar contains 57902 rows and openChrom_K562 has 137373 rows respectively.


Fields on the tables.

mysql> DESCRIBE inVar;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| chrom | varchar(31) | NO   | PRI | NULL    |       |
| pos   | int(10)     | NO   | PRI | NULL    |       |
+-------+-------------+------+-----+---------+-------+

mysql> DESCRIBE openChrom_K562;
+------------+-------------+------+-----+---------+-------+
| Field      | Type        | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| chrom      | varchar(31) | NO   | MUL | NULL    |       |
| chromStart | int(10)     | NO   | MUL | NULL    |       |
| chromEnd   | int(10)     | NO   |     | NULL    |       |
| score      | int(10)     | NO   |     | NULL    |       |
+------------+-------------+------+-----+---------+-------+

Index built in the tables

mysql> SHOW INDEX FROM inVar;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| inVar |          0 | PRIMARY  |            1 | chrom       | A         |        NULL |     NULL | NULL   |      | BTREE      |         |
| inVar |          0 | PRIMARY  |            2 | pos         | A         |       57902 |     NULL | NULL   |      | BTREE      |         |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

mysql> SHOW INDEX FROM openChrom_K562;
+----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table          | Non_unique | Key_name    | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| openChrom_K562 |          1 | start_end   |            1 | chromStart  | A         |      137373 |     NULL | NULL   |      | BTREE      |         |
| openChrom_K562 |          1 | start_end   |            2 | chromEnd    | A         |      137373 |     NULL | NULL   |      | BTREE      |         |
| openChrom_K562 |          1 | chrom_only  |            1 | chrom       | A         |          22 |     NULL | NULL   |      | BTREE      |         |
| openChrom_K562 |          1 | chrom_start |            1 | chrom       | A         |          22 |     NULL | NULL   |      | BTREE      |         |
| openChrom_K562 |          1 | chrom_start |            2 | chromStart  | A         |      137373 |     NULL | NULL   |      | BTREE      |         |
| openChrom_K562 |          1 | chrom_end   |            1 | chrom       | A         |          22 |     NULL | NULL   |      | BTREE      |         |
| openChrom_K562 |          1 | chrom_end   |            2 | chromEnd    | A         |      137373 |     NULL | NULL   |      | BTREE      |         |
+----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

Execution plan on MySQL

mysql> EXPLAIN SELECT `inVar`.chrom, `inVar`.pos, score  FROM `inVar`  LEFT JOIN `openChrom_K562`  ON ( inVar.chrom=openChrom_K562.chrom AND  `inVar`.pos BETWEEN chromStart AND chromEnd );
+----+-------------+----------------+-------+--------------------------------------------+------------+---------+-----------------+-------+-------------+
| id | select_type | table          | type  | possible_keys                              | key        | key_len | ref             | rows  | Extra       |
+----+-------------+----------------+-------+--------------------------------------------+------------+---------+-----------------+-------+-------------+
|  1 | SIMPLE      | inVar          | index | NULL                                       | PRIMARY    | 37      | NULL            | 57902 | Using index |
|  1 | SIMPLE      | openChrom_K562 | ref   | start_end,chrom_only,chrom_start,chrom_end | chrom_only | 33      | tmp.inVar.chrom |  5973 |             |
+----+-------------+----------------+-------+--------------------------------------------+------------+---------+-----------------+-------+-------------+

It seems it only optimizes by looking chrom in two tables. Then do the brute-force comparing in the tables.

Is there any ways to do the further optimization like indexing on the position?

(It is my first time posting the question, sorry for the poor posting quality.)

Eric Ho
  • 149
  • 10
  • Does the query plan change at all if you put the BETWEEN section in a WHERE clause instead of having it be part of the ON clause? (I don't expect this will improve things but it seems worth checking out.) – Ilion Jun 06 '12 at 03:50
  • @Ilion, I have tried, there is no different... – Eric Ho Mar 11 '13 at 17:23

1 Answers1

0

chrom_only is likely to be a bad index selection for your join as you only have chrom 22 values.

If I have interpreted this right the query should be faster if using start_end

SELECT `inVar`.chrom, `inVar`.pos, `openChrom_K562`.score
FROM `inVar`
LEFT JOIN `openChrom_K562`
USE INDEX (`start_end`)
ON (
`inVar`.chrom=`openChrom_K562`.chrom AND
`inVar`.pos BETWEEN `openChrom_K562`.chromStart AND `openChrom_K562`.chromEnd
)
KCD
  • 9,873
  • 5
  • 66
  • 75
  • Thanks your reply. Since the table join in range is really really poor... I have already switch to another approach by building interval tree :P – Eric Ho Mar 11 '13 at 17:18
  • @EricHo, using interval trees seems great, but is there a way to use them (with)in SQL? I've more or less a [similar question here](http://stackoverflow.com/q/27433474/559784). – Arun Dec 12 '14 at 13:26