How to optimize count and order by query in millions of row

Question

Needed help in optimizing order by and count query, I have tables having millions (approx 3 millions) rows.

I have to join 4 tables and fetch the records, When i run the simple query it takes only millisecond to complete but as I try to count or order by having left join table it get stuck for unlimited of time.

Please see the cases below.

DB Server Configuration:

CPU Number of virtual cores: 4
Memory(RAM): 16 GiB
Network Performance: High

Rows in each table:

tbl_customers -  #Rows: 20 million.
tbl_customers_address -  #Row 25 million.
tbl_shop_setting - #Rows 50k
aio_customer_tracking - #Rows 5k

Tables Schema:

CREATE TABLE `tbl_customers` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(225) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `accepts_marketing` TINYINT(1) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_order_id` BIGINT(20) NULL DEFAULT NULL,
    `total_spent` DECIMAL(12,2) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `verified_email` TINYINT(4) NULL DEFAULT NULL,
    `updated_at` DATETIME NULL DEFAULT NULL,
    `created_at` DATETIME NULL DEFAULT NULL,
    `date_updated` DATETIME NULL DEFAULT NULL,
    `date_created` DATETIME NULL DEFAULT NULL,
    PRIMARY KEY (`id`),
    UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `shop_id` (`shop_id`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `tbl_customers_address` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `customer_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_address_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_customer_id` BIGINT(20) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL,
    `last_name` VARCHAR(50) NULL DEFAULT NULL,
    `company` VARCHAR(50) NULL DEFAULT NULL,
    `address1` VARCHAR(250) NULL DEFAULT NULL,
    `address2` VARCHAR(250) NULL DEFAULT NULL,
    `city` VARCHAR(50) NULL DEFAULT NULL,
    `province` VARCHAR(50) NULL DEFAULT NULL,
    `country` VARCHAR(50) NULL DEFAULT NULL,
    `zip` VARCHAR(15) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL,
    `name` VARCHAR(50) NULL DEFAULT NULL,
    `province_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_name` VARCHAR(50) NULL DEFAULT NULL,
    `longitude` VARCHAR(250) NULL DEFAULT NULL,
    `latitude` VARCHAR(250) NULL DEFAULT NULL,
    `default` TINYINT(1) NULL DEFAULT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `customer_id` (`customer_id`),
    INDEX `shopify_address_id` (`shopify_address_id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

CREATE TABLE `tbl_shop_setting` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,   
    `shop_name` VARCHAR(300) NOT NULL COLLATE 'latin1_swedish_ci',
     PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `aio_customer_tracking` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(255) NULL DEFAULT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `domain` VARCHAR(255) NULL DEFAULT NULL,
    `web_session_count` INT(11) NOT NULL,
    `last_seen_date` DATETIME NULL DEFAULT NULL,
    `last_contact_date` DATETIME NULL DEFAULT NULL,
    `last_email_open` DATETIME NULL DEFAULT NULL,
    `created_date` DATETIME NOT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id_shop_id` (`shopify_customer_id`, `shop_id`),
    INDEX `last_seen_date` (`last_seen_date`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

Query Cases Running and Not Running:

1. Running:  Below query fetch the records by joining all the 4 tables, It takes only 0.300 ms.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20

2. Not running: Simply when try to get the count of these row stuk the query, I waited 10 min but still running.

SELECT 
     COUNT(DISTINCT c.shopify_customer_id)   -- what makes #2 different
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20


3. Not running: In the #1 query we simply put the 1 Order by clause and it get stuck, I waited 10 min but still running. I study query optimization some article and tried by indexing, Right Join etc.. but still not working.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
  ORDER BY `t`.`last_seen_date`    -- what makes #3 different
LIMIT 20

EXPLAIN QUERY #1:

EXPLAIN QUERY #2:

EXPLAIN QUERY #3:

Any suggestion to optimize the query, table structure are welcome.

WHAT I'M TRYING TO DO:

tbl_customers table contains the customer info, tbl_customer_address table contains the addresses of the customers(one customer may have multiple address), And aio_customer_tracking table contains visiting records of the customer last_seen_date is the visiting date.

Now, simply I want to fetch and count the customers, with their one of the address, and visiting info. Also, I may order by any of the column from these 3 tables, In my example i am ordering by last_seen_date (the default order). Hope this explanation helps to understand what i am trying to do.

Unfortunately, your queries are broken, and it is not clear what you want to get as a result, so we cannot tell you how to optimize (or even fix) them. E.g: `select last_seen_date from table group by id` will give you a random row for `last_seen_date` (see e.g. [here](https://stackoverflow.com/a/38551525/6248528)), and then you want to order by it. In your second query, `count(distinct x) group by x` is redundant (it's always 1, that is the point of `group by`), also anything you `left join` to it has no effect (but again, you might want to query something different). Also, — Solarflare, Jun 08 '18 at 14:52
`order by last_seen_date` will list `null` first; 20 million customers and (at most) 5k customers with a `last_seen_date` will get you 19.995.000 rows with `null` first (so to optimize, just remove the `order by`). If you want to get rows with an actual `last_seen_date`, your query could finish in less than a second (removing the `left join` might already do it), but again, we do not really know what your desired result is, so before you try to optimize it, try to make it work/give the right result. (We can help you with that if you add details/sample data/desired result). — Solarflare, Jun 08 '18 at 14:52
@Solarflare, thanks for comments. Actually, aio_customer_tracking table contains the visiting log of the customers and I want to fetch the customers with their last_seen_date, latest seen customers at the top, that's why I am ordering with the last_seen_date. hope this make sense. — Irfan.gwb, Jun 08 '18 at 18:03
Please highlight the code that is different between these queries. — Rick James, Jun 10 '18 at 22:26
@RickJames, highlighted the difference and added the explain of the each quires, Hope this helps you. — Irfan.gwb, Jun 11 '18 at 04:52
If you remove one or other `LEFT`, do you get the correct resultset? — Rick James, Jun 11 '18 at 07:08
@Irfan.gwb - Can you explain what the query is trying to do? I don't see a straightforward way to optimize the 3 queries; perhaps understanding their intent would lead to a way to rewrite them. — Rick James, Jun 11 '18 at 07:31
@RickJames, It works if remove the LEFT JOIN from the `t` and group by `c.shopify_customer_id` both in query #3. Adding group by `c.shopify_customer_id` not working. — Irfan.gwb, Jun 11 '18 at 08:07
@RickJames, I have also added at bottom what i am trying to do with these queries, it will help to understand. — Irfan.gwb, Jun 11 '18 at 08:26
You want to "fetch and count the customers", yet Q1 and Q3 do not "count" and they `LIMIT 20`. Please clarify. — Rick James, Jun 12 '18 at 01:23
@RickJames, Q1 is just an example of running query, Q2 & Q3 are the cases which is not running I am looking the solution for Q2 & Q3. — Irfan.gwb, Jun 12 '18 at 04:41

Rick James · Answer 1 · 2018-06-12T03:37:09.980

In query #1, but not the other two, the optimizer can use

UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)

to cut the query short for

GROUP BY c.shopify_customer_id
LIMIT 20

This is because it can stop after 20 items of the index. The query is not ultra-fast because of the derived table (subquery t) that hits about 51K rows.

Query #2 may be slow simply because the Optimizer failed to notice and remove the redundant DISTINCT. Instead, it may be thinking it can't stop after 20.

Query #3 must go entirely through table c to get every shopify_customer_id group. This is because the ORDER BY prevents a short curcuit to get to the LIMIT 20.

The columns in a GROUP BY must include all the non-aggregate columns in the SELECT except any that are uniquely defined by the group by columns. Since you have said that there can be multiple addresses for a single shopify_customer_id, then fetching ca.address1 is not proper in connection with GROUP BY shopify_customer_id. Similarly, the subquery seems to be improper with respect to last_seen_date, last_contact_date.

In aio_customer_tracking, this change (to a "covering" index) may help a little:

INDEX (`shopify_customer_id`)

to

INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)

Dissecting the goal

Now, simply I want to ... count the customers

To count the customers, do this, but don't try to combine it with "fetching":

SELECT COUNT(*) FROM tbl_customers;

Now, simply I want to fetch ... the customers...

tbl_customers - #Rows: 20 million.

Surely you don't want to fetch 20 million rows! I don't want to think about how to try to do that. Please clarify. And I won't accept paginating through that many rows. Perhaps there is a WHERE clause?? The WHERE clause is (usually) the most important part of Optimization!

Now, simply I want to fetch ... the customers, with their one of the address, and visiting info.

Assuming that the WHERE filters down to a "few" customers, then JOINing to another table to get "any" address and "any" visiting info, may be problematical and/or inefficient. To require the "first" or "last" instead of "any" won't be any easier, but might be more meaningful.

May I suggest that your UI first find a few customers, then if the user wants, go to another page with all the addresses and all the visits. Or can the visits be in the hundreds or more?

Also, I may order by any of the column from these 3 tables, In my example i am ordering by last_seen_date (the default order).

Let's focus on optimizing the WHERE, then tack last_seen_date on the end of any index.

`query #1`: yes, it will help a little but i am ok with query #1. `query #2`: any idea or changes in query so that optimizer can notice limit? `query #3`: I am ok with any of the row from the address table so no issue with that, I changed the index as suggested but it does not help.. still my query get stuck. — Irfan.gwb, Jun 11 '18 at 07:56

score 4 · Answer 2 · answered Jun 11 '18 at 04:51

4

shopify_customer_id is unique in tbl_customers table, then in 2nd query why you use distinct and group by in shopify_customer_id column?

Please get rid of that.

answered Jun 11 '18 at 04:51

mamun0024

71
5

LEFT JOIN table `tbl_customers_address` may contains multiple records regarding the `shopify_customer_id` and I have to fetch unique customer row. Therefore, added the group by and distinct. – Irfan.gwb Jun 11 '18 at 05:06
@Irfan.gwb - `GROUP BY` has the effect of `DISTINCT`. mamun is suggesting getting rid of `DISTINCT`. – Rick James Jun 11 '18 at 07:06
1

If there are multiple addresses, which one do you want to get?? My point is that the `GROUP BY` is "invalid" in _every_ one of the queries. – Rick James Jun 11 '18 at 07:10
I am OK with the any of the address row, no issue with that. – Irfan.gwb Jun 11 '18 at 08:28

iLikeMySql · Answer 3 · 2018-06-15T20:21:14.420

Query 2 contains a logical mistake as pointed out by others: the count(distinct(c.shopify_customer_id)) will return a single value, therefore your group by is only complicating the query (this might indeed make MySQL grouping by shopify_customer_id first and then executing the count(distinct(shopify_customer_id )) which could be the reason for the somehow long execution time

The order by of Query 3 can not be optimized as you are joining on a subselect which cannot be indexed. The time it takes is simply the time the system needs to order the result set.

The solution to your problem would be to:

change the index shopify_customer_id (shopify_customer_id) of table tbl_customers_address to shopify_customer_id (shopify_customer_id,default) to optimize the following query
create a table with the result from Query 1 (result) but without

LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id.
alter the result table and add a column for last_seen_date and indexes for last_seen_date and shopify_customer_id
create a table for the result of this query (last_Date) :

SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id

Update the result table with the values from table last_Date

Now you can run a query against the result table ordered by last_Date using the index you created.

The whole process should take way less time than executing query 2 or query 3

a lead alcove · Answer 4 · 2018-06-29T05:13:54.460

0

You have too many indexes and that can be a real performance killer when it comes to inserts, updates, and deletes, as well as occasionally for selects depending on optimization settings.

Also, remove the GROUP BY statement.

There's more I could say about correct use of clustered vs. nonclustered indexes, GROUP BY, ORDER BY, WHERE, and views, for query optimization. However, I think if you remove some indexes, your queries will speed up a lot. (Maybe also rework your queries to follow stricter SQL standards and be a bit more logical, but that's outside the scope of this question.)

One more thing - what are you doing with the query results? Is this being stored somewhere and accessed for lookups, used for calculations, used for automated reports, displaying through web database connection, etc? This makes a difference because if you just need a report/backup or export to a flat file, then there are way more efficient ways to get this data out. Lots of different options depending on what you're doing.

edited Jun 29 '18 at 05:13

answered Jun 13 '18 at 13:08

a lead alcove

305
1
14

Removing indexes slightly helps `INSERTs`, but can be a performance killer for `SELECTs`. I cannot think of a case in MySQL where "too many indexes" would be a "performance killer". `SELECTs` will not speed up at all. – Rick James Jun 18 '18 at 19:07
Totally. I should have been more specific about the effects on performance on queries vs inserts. Honestly I just skimmed her/his question after seeing a few things I didn't like with a few joins and the group by. Answers mostly just addressed the R of the CRUD and I thought it was worth addressing. There are certainly cases where too many indexes can slow down queries, at least in SQL Server, but they are exceptions. I should have spent more time reading their full question and providing more detail in my answer. However inserts can definitely be affected more than "slightly." – a lead alcove Jun 20 '18 at 18:02
1

Consider editing your Answer to reflect what you just said. – Rick James Jun 20 '18 at 19:06
Thanks, just did. – a lead alcove Jun 29 '18 at 05:23