Group by ranges and consider empty ranges
This is an extension to: https://stackoverflow.com/a/41275222/895245 which also creates bins for the empty ranges with 0 entries in them:
select x, sum(cnt) from (
select floor(x/5)*5 as x,
count(*) as cnt
from t
group by 1
union
select *, 0 as cnt from generate_series(0, 15, 5)
)
group by x
Tested with:
create table t(x integer)
insert into t values (
0,
2,
2,
3,
5,
6,
6,
8,
9,
17,
)
Output:
0|4
5|5
10|0
15|1
The trick is to create a range of zeroes with generate_series
, and then sum
it up with the populated ranges. It doesn't alter the counts for populated ranges, but produces the 0 entry.
Multiple range queries are generally faster than the floor()
technique
Although using floor()
is convenient and self contained, it is likely going to be slower on most (?) implementations.
For example, I create a test database with 10m rows:
f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
time sqlite3 "$f" 'insert into t select value as x from generate_series(0,9999999)'
time sqlite3 "$f" 'create index tx on t(x)'
Then, multiple queries with bins of size 1m:
i=0
while [ $i -lt 10 ]; do
sqlite3 10m.sqlite "select count(*) from t where x >= $i and x < $((i + 1000000))"
i=$((i + 1))
done
finishes in 0.45s. Doing floor however:
sqlite3 10m.sqlite <<EOF
select floor(x/1000000)*1000000 as x,
count(*) as cnt
from t
group by 1
order by 1
EOF
takes 0.37 seconds.
My theory is that this is simply because of the overhead of the floor, division and multiplication, which the ranged version skips, and just goes through the B-tree index counting.
Tested on Ubuntu 23.04 SQLite 3.40.1, Lenovo ThinkPad P51, SSD: Samsung MZVLB512HAJQ-000L7 512GB SSD, 3 GB/s nominal speed.