4

We have the following mysql tables (simplified for going straight to the point)

CREATE TABLE `MONTH_RAW_EVENTS` (
  `idEvent` int(11) unsigned NOT NULL,
  `city` varchar(45) NOT NULL,
  `country` varchar(45) NOT NULL,
  `ts` datetime NOT NULL,
  `idClient` varchar(45) NOT NULL,
  `event_category` varchar(45) NOT NULL,
  ... bunch of other fields
  PRIMARY KEY (`idEvent`),
  KEY `idx_city` (`city`),
  KEY `idx_country` (`country`),
  KEY `idClient` (`idClient`),
) ENGINE=InnoDB;

CREATE TABLE `compilation_table` (
  `idClient` int(11) unsigned DEFAULT NULL,
  `city` varchar(200) DEFAULT NULL,
  `month` int(2) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `events_profile` int(10) unsigned NOT NULL DEFAULT '0',
  `events_others` int(10) unsigned NOT NULL DEFAULT '0',
  `events_total` int(10) unsigned NOT NULL DEFAULT '0',
  KEY `idx_month` (`month`),
  KEY `idx_year` (`year`),
  KEY `idx_idClient` (`idClient`),
  KEY `idx_city` (`city`)
) ENGINE=InnoDB;

MONTH_RAW_EVENTS contains almost 20M rows having user performed actions in a website, it sizes almost 4GB

compilation_table has a summary clients/cities per each month, we use it for displaying stats on a website in real time

We process the statistics (from first table to second one) once per month, and we're trying to optimize a query that performs such operation (as until now we're processing everything in PHP which takes loong loong time)

Here's the query we came up with, which seems doing the job when using small subsets of data, the problem that takes more than 6hours to process for the full set of data

INSERT INTO compilation_table (idClient,city,month,year,events_profile,events_others)


    SELECT  IFNULL(OTHERS.idClient,AP.idClient) as idClient,
            IF(IFNULL(OTHERS.city,AP.city)='','Others',IFNULL(OTHERS.city,AP.city)) as city,
        01,2014,
    IFNULL(AP.cnt,0) as events_profile,
        IFNULL(OTHERS.cnt,0) as events_others           

    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 LEFT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

 UNION

    SELECT  IFNULL(OTHERS.idClient,CLIPROFILE.idClient) as idClient,
            IF(IFNULL(OTHERS.city,CLIPROFILE.city)='','Others',IFNULL(OTHERS.city,CLIPROFILE.city)) as city,
            01,2014,
            IFNULL(CLIPROFILE.cnt,0) as events_profile,
            IFNULL(OTHERS.cnt,0) as events_others           
    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 RIGHT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

What we're trying to do is a FULL Outer Join in Mysql so the basic schema of the query is like: the one proposed here

How can we optimize the query? we've been trying diferent indexes, swiching things around but after 8 hours still didnt finished running,

The MySQL server is a Percona MySQL 5.5 dedicated machine with 2cpu, 2GB ram, and SSD disk, we optimized the configuration of such server using Percona tools,

Any help would be really appreciated,

thanks

Community
  • 1
  • 1
jmserra
  • 1,296
  • 4
  • 18
  • 34

1 Answers1

6

You're doing a UNION which results in DISTINCT processing.

It's usually better to rewrite a Full Join to a Left Join plus the non-matching rows of a Right Join (if it's proper 1:n join)

OTHERS LEFT JOIN CLIPROFILE 
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
union all
OTHERS RIGHT JOIN CLIPROFILE 
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
WHERE OTHERS.idClient IS NULL 

Additionally you might materialize the results of the Derived Tables in temp tables before joining them, thus the calculation is only done once (I don't know if MySQL's optimizer is smart enough to do that automatically).

Plus it might be more efficient to group by and join on city/country as separate columns and do the CONCAT(city,', ',country) as city in the outer step.

dnoeth
  • 59,503
  • 4
  • 39
  • 56
  • That was awesome buddy, now we've been able to optimize the whole process down to 10 minutes instead of 10 hours :) i think creating those temp tables helped too, tried first to do MEMORY tables but they were too big to fit in our 2GB server, so ended up using MyISAM and now works like a charm! – jmserra Mar 10 '14 at 11:19
  • I know this post is kind of old but thanks @dnoeth for the tip: amazing improvement in perfomances – Luca Nov 16 '21 at 15:53