6

Suppose I have two tables as follows (data taken from this SO post):

Table d1:

 x start end
 a     1   3
 b     5  11
 c    19  22
 d    30  39
 e     7  25

Table d2:

 x pos
 a   2
 a   3
 b   3
 b  12
 c  20
 d  52
 e  10

The first row in both tables are column headers. I'd like to extract all the rows in d2 where column x matches with d1 and pos1 falls within (including boundary values) d1's start and end columns. That is, I'd like the result:

 x pos start  end
 a   2     1    3
 a   3     1    3
 c  20    19   22
 e  10     7   25

The way I've seen this done so far is:

SELECT * FROM d1 JOIN d2 USING (x) WHERE pos BETWEEN start AND end

But what is not clear to me is if this operation is done as efficient as it can be (i.e., optimised internally). For example, computing the entire join first is not really a scalable approach IMHO (in terms of both speed and memory).

Are there any other efficient query optimisations (ex: using interval trees) or other algorithms that can handle ranges efficiently (again, in terms of both speed and memory) in SQL that I can make use of? It doesn't matter if it's using SQLite, PostgreSQL, mySQL etc..

What is the most efficient way to perform this operation in SQL?

Thank you very much.

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    what happens if you use ON in place of the USING and include the BETWEEN into the ON clause dropping WHERE altogether, i.e. `SELECT * FROM d1 JOIN d2 ON d1.x = d2.x AND d2.pos BETWEEN d1.start AND d1.end`. On a side note, it is always a good idea to specify what table the column belongs to to avoid possible query problems in future when a column is added to a table that has the same name as in another table – cha Dec 11 '14 at 22:58
  • @cha, the one in the Q turned out to be faster (27s) vs yours (30s) on a slightly bigger data. What is the difference between the two (internally)? Is there an online resource you could point me to? I'm more curious to know if there are algorithms designed for dealing with intervals implemented in SQL. And thanks much for the tip. – Arun Dec 12 '14 at 08:09
  • A single interval lookup typically can be optimized with a 'normal' index, if it is on the last column in the index. (Here: `CREATE INDEX whatever ON d2(x,pos)`) – CL. Dec 12 '14 at 09:28
  • @CL, not sure I follow exactly but I created the index as you mention and ran the query again. Runtime is identical (27s). Here are the links to [d1.csv](https://dl.dropboxusercontent.com/u/3985883/Arun/d1.csv) and [d2.csv](https://dl.dropboxusercontent.com/u/3985883/Arun/d2.csv) if you'd like to test. I am an R user and co-developer of data.table R package. We recently implemented overlap joins using binary search, and it takes 0.7 seconds. Wondering if there's such a speedup possible for this case. – Arun Dec 12 '14 at 09:58
  • According to the EXPLAIN QUERY PLAN output, SQLite is as efficient as possible with this index. I don't know anything about the specifics of your measurements. – CL. Dec 12 '14 at 12:12
  • @CL., thanks. Like I said before, I'd like to know how it's done internally, and if it's possible to use interval trees (or other algorithms) designed for operating on interval data efficiently. I just found [this post](http://stackoverflow.com/q/10907910/559784) where the OP seems to have resorted to interval trees, for example, but it's not clear if it's in SQL... – Arun Dec 12 '14 at 12:24
  • You might want to test Postgres' range types with this. They are pretty efficient and they might be faster than such a `between` query. –  Jan 13 '15 at 23:07

1 Answers1

0

Not sure how it all works out internally, but depending on the situation I would advice to play around with a table that 'rolls out' all the values from d1 and then join on that one. This way the query engine can pinpoint the right record 'exactly' instead of having to find a combination of boundaries that match the value being looked for.

e.g.

x value
a  1
a  2
a  3
b  5
b  6
b  7
b  8
b  9
b 10
b 11
c 19 etc..

given an index on the value column (**), this should be quite a bit faster than joining with the BETWEEN start AND end on the original d1 table IMHO.

Off course, each time you make changes to d1, you'll need to adjust the rolled out table too (trigger?). If this happens frequently you'll spend more time updating the rolled out table than you gained in the first place! Additionally, this might take quite a bit of (disk)space quickly if some of the intervals are really big; and also, this assumes we don't need to look for non-whole numbers (e.g. what if we look for the value 3.14 ?)

(You might consider experimenting with a unique one on (value, x) here...)

deroby
  • 5,902
  • 2
  • 19
  • 33
  • How would I go about doing this in SQL – Daniel Zhang Mar 23 '22 at 21:28
  • @DanielZhang Depends a lot on the environment (MSSQL, Postgres, etc..) You'd have to create a trigger on `d1` that updates the `roll-out` table whenever there is a change. Re-reading my answer from 7 years ago I'm a bit sceptical on my own solution tbh =) Assuming `d1` is fairly static and you'd need to join it very often to (a more volatile) `d2` it might be worth it but the cost in preparing and diskspace required are (potentially) going to be very high indeed. – deroby Mar 25 '22 at 08:31