92

Is there a way to specify bin sizes in MySQL? Right now, I am trying the following SQL query:

select total, count(total) from faults GROUP BY total;

The data that is being generated is good enough but there are just too many rows. What I need is a way to group the data into predefined bins. I can do this from a scripting language, but is there a way to do it directly in SQL?

Example:

+-------+--------------+
| total | count(total) |
+-------+--------------+
|    30 |            1 | 
|    31 |            2 | 
|    33 |            1 | 
|    34 |            3 | 
|    35 |            2 | 
|    36 |            6 | 
|    37 |            3 | 
|    38 |            2 | 
|    41 |            1 | 
|    42 |            5 | 
|    43 |            1 | 
|    44 |            7 | 
|    45 |            4 | 
|    46 |            3 | 
|    47 |            2 | 
|    49 |            3 | 
|    50 |            2 | 
|    51 |            3 | 
|    52 |            4 | 
|    53 |            2 | 
|    54 |            1 | 
|    55 |            3 | 
|    56 |            4 | 
|    57 |            4 | 
|    58 |            2 | 
|    59 |            2 | 
|    60 |            4 | 
|    61 |            1 | 
|    63 |            2 | 
|    64 |            5 | 
|    65 |            2 | 
|    66 |            3 | 
|    67 |            5 | 
|    68 |            5 | 
------------------------

What I am looking for:

+------------+---------------+
| total      | count(total)  |
+------------+---------------+
|    30 - 40 |            23 | 
|    40 - 50 |            15 | 
|    50 - 60 |            51 | 
|    60 - 70 |            45 | 
------------------------------

I guess this cannot be achieved in a straight forward manner but a reference to any related stored procedure would be fine as well.

huysentruitw
  • 27,376
  • 9
  • 90
  • 133
Legend
  • 113,822
  • 119
  • 272
  • 400

10 Answers10

173

This is a post about a super quick-and-dirty way to create a histogram in MySQL for numeric values.

There are multiple other ways to create histograms that are better and more flexible, using CASE statements and other types of complex logic. This method wins me over time and time again since it's just so easy to modify for each use case, and so short and concise. This is how you do it:

SELECT ROUND(numeric_value, -2)    AS bucket,
       COUNT(*)                    AS COUNT,
       RPAD('', LN(COUNT(*)), '*') AS bar
FROM   my_table
GROUP  BY bucket;

Just change numeric_value to whatever your column is, change the rounding increment, and that's it. I've made the bars to be in logarithmic scale, so that they don't grow too much when you have large values.

numeric_value should be offset in the ROUNDing operation, based on the rounding increment, in order to ensure the first bucket contains as many elements as the following buckets.

e.g. with ROUND(numeric_value,-1), numeric_value in range [0,4] (5 elements) will be placed in first bucket, while [5,14] (10 elements) in second, [15,24] in third, unless numeric_value is offset appropriately via ROUND(numeric_value - 5, -1).

This is an example of such query on some random data that looks pretty sweet. Good enough for a quick evaluation of the data.

+--------+----------+-----------------+
| bucket | count    | bar             |
+--------+----------+-----------------+
|   -500 |        1 |                 |
|   -400 |        2 | *               |
|   -300 |        2 | *               |
|   -200 |        9 | **              |
|   -100 |       52 | ****            |
|      0 |  5310766 | *************** |
|    100 |    20779 | **********      |
|    200 |     1865 | ********        |
|    300 |      527 | ******          |
|    400 |      170 | *****           |
|    500 |       79 | ****            |
|    600 |       63 | ****            |
|    700 |       35 | ****            |
|    800 |       14 | ***             |
|    900 |       15 | ***             |
|   1000 |        6 | **              |
|   1100 |        7 | **              |
|   1200 |        8 | **              |
|   1300 |        5 | **              |
|   1400 |        2 | *               |
|   1500 |        4 | *               |
+--------+----------+-----------------+

Some notes: Ranges that have no match will not appear in the count - you will not have a zero in the count column. Also, I'm using the ROUND function here. You can just as easily replace it with TRUNCATE if you feel it makes more sense to you.

I found it here http://blog.shlomoid.com/2011/08/how-to-quickly-create-histogram-in.html

Cary Reams
  • 195
  • 1
  • 7
Jaro
  • 3,799
  • 5
  • 31
  • 47
  • 1
    As of MySQL 8.0.3, you now have the ability to create histogram statistics in order to provide more statistics to the optimizer - see http://mysqlserverteam.com/histogram-statistics-in-mysql/ – Jaro Oct 18 '17 at 08:12
  • 1
    You don't even need the "bar" part of the query; the numbers themselves already form a logarithmic bar chart / histogram. – enharmonic Dec 18 '19 at 15:05
38

Mike DelGaudio's answer is the way I do it, but with a slight change:

select floor(mycol/10)*10 as bin_floor, count(*)
from mytable
group by 1
order by 1

The advantage? You can make the bins as large or as small as you want. Bins of size 100? floor(mycol/100)*100. Bins of size 5? floor(mycol/5)*5.

Bernardo.

Bernardo Siu
  • 1,359
  • 13
  • 13
  • as carillonator said your group by & order by better should be bin_floor or 1 - Ill upvote if you correct it, this is the best answer for me – B M Jun 03 '14 at 10:48
  • Fair enough, @b-m. Changed as suggested by carillonator. – Bernardo Siu Jun 03 '14 at 14:59
  • 1
    and if you want a nicer column name you can do `concat(floor(mycol/5)*5," to ",floor(mycol/5)*5+5)` – alex9311 Jul 26 '16 at 11:56
  • This is actually better than simple `round(mycol, -2)` from the accepted answer since it lets user define any non-decimal "range". I'd just use `round` instead of `floor` since it properly rounds the numbers. – meridius Jul 16 '20 at 09:21
  • what is group by 1? – luky Apr 24 '21 at 12:30
  • 1
    @luky it means that you group by the first field in the select statement (in this case, bin_floor). As far as I know, only mysql implements this syntax. – Bernardo Siu Apr 26 '21 at 00:27
16
SELECT b.*,count(*) as total FROM bins b 
left outer join table1 a on a.value between b.min_value and b.max_value 
group by b.min_value

The table bins contains columns min_value and max_value which define the bins. note that the operator "join... on x BETWEEN y and z" is inclusive.

table1 is the name of the data table

Ofri Raviv
  • 24,375
  • 3
  • 55
  • 55
  • 2
    Why is the syntax coloring for SQL so bad? How can I improve this? Maybe I should post it on meta ;) – Ofri Raviv Nov 19 '09 at 17:40
  • 2
    In this case is necessary a template table to define min an max. Only with SQL is not possible. – Cesar Nov 19 '09 at 17:45
  • SQL Guru! Exactly what I wanted. I guess care should be taken when creating the bins table. Otherwise everything works perfectly. :) Thank You. I just finished writing a python script but this is just what I needed... – Legend Nov 19 '09 at 17:53
  • @Legend: Actually, I'm quite a n00b when it comes to SQL. but this was a cool and useful question so I liked the exercise... – Ofri Raviv Nov 19 '09 at 18:02
  • Thanks... In addition, I will also post a direct python script that utilizes your sql query to generate the data for say, gnuplot – Legend Nov 19 '09 at 18:26
  • 1
    It's important to see @David West's answer (which should have been a comment, here) about how the COUNT(*) produces 1 when when it should produce zero. That may not be a big problem for you, but it can skew statistical data and make you look a little silly if someone notices :) – Christopher Schultz Mar 17 '15 at 18:00
11

Ofri Raviv's answer is very close but incorrect. The count(*) will be 1 even if there are zero results in a histogram interval. The query needs to be modified to use a conditional sum:

SELECT b.*, SUM(a.value IS NOT NULL) AS total FROM bins b
  LEFT JOIN a ON a.value BETWEEN b.min_value AND b.max_value
GROUP BY b.min_value;
Jacob Gabrielson
  • 34,800
  • 15
  • 46
  • 64
David West
  • 111
  • 1
  • 2
10
select "30-34" as TotalRange,count(total) as Count from table_name
   where total between 30 and 34
union (
select "35-39" as TotalRange,count(total) as Count from table_name 
   where total between 35 and 39)
union (
select "40-44" as TotalRange,count(total) as Count from table_name
   where total between 40 and 44)
union (
select "45-49" as TotalRange,count(total) as Count from table_name
   where total between 45 and 49)
etc ....

As long as there are not too many intervals, this is a pretty good solution.

Gabe Moothart
  • 31,211
  • 14
  • 77
  • 99
sammy
  • 223
  • 5
  • 10
  • 1
    +1 This is the only solution here that allows bins to be of different size – Gabe Moothart Dec 10 '12 at 18:52
  • great - no need for additional tables – NiRR Jul 25 '16 at 08:28
  • +1 This is the most flexible solution imo, and seems to best fit the use case of wanting to bin from within SQL. any case where bin ranges need to be programatically derived, its likely better to do that outside of SQL. again imo – Ryan McCoy Apr 02 '20 at 01:20
4

I made a procedure that can be used to automatically generate a temporary table for bins according to a specified number or size, for later use with Ofri Raviv's solution.

CREATE PROCEDURE makebins(numbins INT, binsize FLOAT) # binsize may be NULL for auto-size
BEGIN
 SELECT FLOOR(MIN(colval)) INTO @binmin FROM yourtable;
 SELECT CEIL(MAX(colval)) INTO @binmax FROM yourtable;
 IF binsize IS NULL 
  THEN SET binsize = CEIL((@binmax-@binmin)/numbins); # CEIL here may prevent the potential creation a very small extra bin due to rounding errors, but no good where floats are needed.
 END IF;
 SET @currlim = @binmin;
 WHILE @currlim + binsize < @binmax DO
  INSERT INTO bins VALUES (@currlim, @currlim+binsize);
  SET @currlim = @currlim + binsize;
 END WHILE;
 INSERT INTO bins VALUES (@currlim, @maxbin);
END;

DROP TABLE IF EXISTS bins; # be careful if you have a bins table of your own.
CREATE TEMPORARY TABLE bins (
minval INT, maxval INT, # or FLOAT, if needed
KEY (minval), KEY (maxval) );# keys could perhaps help if using a lot of bins; normally negligible

CALL makebins(20, NULL);  # Using 20 bins of automatic size here. 

SELECT bins.*, count(*) AS total FROM bins
LEFT JOIN yourtable ON yourtable.value BETWEEN bins.minval AND bins.maxval
GROUP BY bins.minval

This will generate the histogram count only for the bins that are populated. David West ought to be right in his correction, but for some reason, unpopulated bins do not appear in the result for me (despite the use of a LEFT JOIN — I do not understand why).

Dologan
  • 4,554
  • 2
  • 31
  • 33
3
SELECT
    CASE
        WHEN total <= 30 THEN "0-30"
        WHEN total <= 40 THEN "31-40"       
        WHEN total <= 50 THEN "41-50"
        ELSE "50-"
    END as Total,
    count(*) as count
GROUP BY Total 
ORDER BY Total;
Rúnar Berg
  • 4,229
  • 1
  • 22
  • 38
Zebra
  • 139
  • 2
  • 10
3

That should work. Not so elegant but still:

select count(mycol - (mycol mod 10)) as freq, mycol - (mycol mod 10) as label
from mytable
group by mycol - (mycol mod 10)
order by mycol - (mycol mod 10) ASC

via Mike DelGaudio

Renaud
  • 16,073
  • 6
  • 81
  • 79
2

Equal width binning into a given count of bins:

WITH bins AS(
   SELECT min(col) AS min_value
        , ((max(col)-min(col)) / 10.0) + 0.0000001 AS bin_width
   FROM cars
)
SELECT tab.*,
   floor((col-bins.min_value) / bins.bin_width ) AS bin
FROM tab, bins;

Note that the 0.0000001 is there to make sure that the records with the value equal to max(col) do not make it's own bin just by itself. Also, the additive constant is there to make sure the query does not fail on division by zero when all the values in the column are identical.

Also note that the count of bins (10 in the example) should be written with a decimal mark to avoid integer division (the unadjusted bin_width can be decimal).

user824276
  • 617
  • 1
  • 7
  • 20
1

In addition to great answer https://stackoverflow.com/a/10363145/916682, you can use phpmyadmin chart tool for a nice result:

enter image description here

enter image description here

Community
  • 1
  • 1
zub0r
  • 1,299
  • 1
  • 14
  • 20