3

The object of my query is to get all rows from table a where gender = f and username does not exist in table b where campid = xxxx. Here is the query I am using with success:

SELECT `id` 
FROM pool 
  LEFT JOIN sent 
    ON  pool.username = sent.username 
    AND sent.campid = 'YA1LGfh9' 
WHERE sent.username IS NULL 
  AND pool.gender = 'f'

The problem is that the query takes over 9 minutes to complete, the pool table contains over 10 million rows and the sent table is eventually going to grow even larger than that. I have created indexes for many of the columns including username and gender. However, MySQL refuses to use any of my indexes for this query. I even tried using FORCE INDEX. Here are my indexes from pool and the output of EXPLAIN for my query:

+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| pool  |          0 | PRIMARY  |            1 | id          | A         |     9326880 |     NULL | NULL   |      | BTREE      |         |
| pool  |          1 | username |            1 | username    | A         |     9326880 |     NULL | NULL   |      | BTREE      |         |
| pool  |          1 | source   |            1 | source      | A         |           6 |     NULL | NULL   |      | BTREE      |         |
| pool  |          1 | gender   |            1 | gender      | A         |           9 |     NULL | NULL   |      | BTREE      |         |
| pool  |          1 | location |            1 | location    | A         |       59030 |     NULL | NULL   |      | BTREE      |         |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)

mysql> explain SELECT `id` FROM pool FORCE INDEX (username) LEFT JOIN sent ON pool.username = sent.username AND sent.campid = 'YA1LGfh9' WHERE sent.username IS NULL AND pool.gender = 'f';
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra                   |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
|  1 | SIMPLE      | pool  | ALL  | NULL          | NULL | NULL    | NULL | 9326881 | Using where             |
|  1 | SIMPLE      | sent  | ALL  | NULL          | NULL | NULL    | NULL |     351 | Using where; Not exists |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
2 rows in set (0.00 sec)

also, here are my indexes for the sent table:

+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| sent  |          0 | PRIMARY  |            1 | primary_key | A         |         351 |     NULL | NULL   |      | BTREE      |         |
| sent  |          1 | username |            1 | username    | A         |         351 |     NULL | NULL   |      | BTREE      |         |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
2 rows in set (0.00 sec)

You can see that no indexes are not being used and so my query takes extremely too long. If anyone has a solution that involves reworking the query, please show me an example of how to do it using my data structure so that I won't have any confusion of how to implement and test. Thank you.

ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
xendi
  • 2,332
  • 5
  • 40
  • 64

1 Answers1

4

First, your original query was correct in your placement of everything... including the camp. By using a LEFT JOIN from Pool to Sent, and then pulling a required equality such as "CAMP" into the WHERE clause as previously suggested is ultimately converting that into an INNER JOIN, thus requiring entry on both sides. Leave it as you had it.

You already have an index on user name on the sent table, but I would do the following.

build an index on the "sent" table on (CampID, UserName) as a composite (ie: multiple key) index. This way the left join will be optimized for BOTH entries.

On your "pool" table, try a composite index on 3 fields of (gender, username, id ).

By doing this, you can take advantage of NOT having to go through all the actual pages of data that encompass your 10+ million records. Since the index HAS the columns for compare, it doesn't have to find the actual record and look at the columns, it can use those of the index directly.

Also, for grins, I added keyword "STRAIGHT_JOIN" which tells MySQL to query exactly as I show and don't try to think for me. MANY times, I've found this to significantly improve query performance... On very few have I been given feedback that it has NOT helped.

SELECT STRAIGHT_JOIN
      p.id
   FROM 
      pool p
         LEFT JOIN sent s
            ON s.campid = 'YA1LGfh9' 
            AND p.username = s.username 
   WHERE 
          p.gender = 'f'
      AND s.username IS NULL 

All that said, you are still going to be returning how many records out of the 10+ million... if the pool has 10+ million, and the single camp only has 5,000. You will still be returning almost the entire set.

DRapp
  • 47,638
  • 12
  • 72
  • 142
  • 1
    I would prefer `(gender, username, id)` – ypercubeᵀᴹ Jun 11 '12 at 11:17
  • @ypercube, good point... by keeping the user name in second position will keep that index from bouncing around to the sent table which would be in proper sequence too. I'll change it. – DRapp Jun 11 '12 at 11:41
  • Okay. I have setup everything to your specification (i think) but I am still having the performance issue. In fact, it is now taking longer than it was with my initial query using no indexes. Here is what I have done: http://pastebin.com/BhyPPVqa the query took almost 13 minutes to complete. Maybe I have done something wrong? – xendi Jun 11 '12 at 12:48
  • @xendi, this may be an exception to the "STRAIGHT_JOIN", try removing that clause, but otherwise SHOULD work. How many records are you expecting back out of the 10+ million. – DRapp Jun 11 '12 at 12:53
  • i removed it but same result. I'm expecting over 6 million. – xendi Jun 11 '12 at 13:16
  • @xendi, How many people (max realistic) would be going to a single camp. I WOULD alter the index to be reversed on the sent table to (UserName, CampID ). I would NOT do a sub-select if a single camp has about 4 million names... the (CampID, userName) I suspect is the culprit of the slow join. vs UserName, CampID (new index attempt) – DRapp Jun 11 '12 at 13:31
  • eventually each camp will have millions of rows. FYI the pool table is a list of usernames to send messages to and the sent table is being used to keep track of who has already been sent a message per each sending campaign. – xendi Jun 11 '12 at 13:34
  • im super stuck. there must be a way – xendi Jun 11 '12 at 14:31
  • @xendi -- super stuck in what way... returning millions of rows will take time. Not trying to respond in a negative tone, but what are you expecting... – DRapp Jun 11 '12 at 14:49
  • well, SELECT * FROM pool WHERE gender = 'f' takes 12 seconds – xendi Jun 11 '12 at 15:01
  • @xendi: Can you add the tables' definitions (in the question or in pastebin)? – ypercubeᵀᴹ Jun 11 '12 at 21:15