1

I was wondering if anyone had some advice for me regarding a histogram-generating query. I have a query that I like (in that it works), but it is extremely slow. Here is the background:

I have a table of metadata, a table of data values where one row in meta_data is a key-row for many (perhaps several thousand) rows in data_values, and a table of histogram bin information:

create table meta_data (
    id int not null primary key, 
    name varchar(100), 
    other_data char(10)
);
create table data_values (
    id int not null primary key, 
    meta_data_id int not null, 
    data_value real
);
create table histogram_bins (
    id int not null primary key,
    bin_min real,
    bin_max real,
    bin_center real,
    bin_size real
);

And a query that creates the histogram:

SELECT md.name AS `Name`, 
       md.other_data AS `OtherData`, 
       hist.bin_center AS `Bin`, 
       SUM(data.data_value BETWEEN hist.bin_min AND hist.bin_max) AS `Frequency`
FROM   histogram_bins hist 
       LEFT JOIN data_values data ON 1 = 1 
       LEFT JOIN meta_data md ON md.id = data.meta_data_id

GROUP BY md.id, `Bin`;

In an earlier version of this query, the BETWEEN ... AND logical statement was down in the JOIN (replacing 1 = 1), but then I would only receive histogram rows with non-zero frequency. I need rows for all of the bins (even the zero-frequency ones), for analysis purposes.

Its pretty darn slow, to the tune of 10-15 minutes or so. The data_values table has about 7.9 million rows, and meta_data weighs in at 15,900 rows -- so maybe it is just going to take a long time!

Thanks very much!

wtt
  • 13
  • 1
  • 5
  • I think what you tried to do is `LEFT OUTER JOIN` (with the original condition) while the output of the `LEFT JOIN` with `1=1` is a cartesian product of the tables `histogram_bins` and `data_values`. This might also explain the slowness of the execution. – Nir Alfasi Aug 18 '12 at 01:31
  • Possible duplicate of [MySQL: Getting data for histogram plot?](http://stackoverflow.com/questions/1764881/mysql-getting-data-for-histogram-plot) – Ciro Santilli OurBigBook.com Oct 07 '15 at 10:39

1 Answers1

0

I think this might help

SELECT h.bin_center AS `Bin`, 
   ISNULL(F.Frequency,0) AS `Frequency`
FROM  histogram_bins h
LEFT JOIN 
   (SELECT hist.bin_center AS `Bin`, 
           COUNT(data_values) AS `Frequency`
    FROM data_values  data
    LEFT JOIN histogram_bins hist ON data.data_value BETWEEN hist.bin_min AND hist.bin_max        
    GROUP BY md.name, md.other_data, hist.bin_center) F ON F.bin_center = h.bin_center

I changed the order of the tables because I think it's best to find the corresponding bin for every record in the data and then just count how many there are grouped by bin

saul672
  • 737
  • 5
  • 6
  • Hi, thanks a lot for the suggestion. I think that would work well if there was only one data set (one histogram). In the live version, the output contains histograms for many data sets, so there needs to be zero-bin rows for each data set -- Maybe that is why the cartesian product (as mentioned above) works, but is soooo slow. – wtt Aug 18 '12 at 11:11