0

My db tables are growing very fast (and will continue), at this time I have a problem with this query (well, others too):

select user_id from post where user_id not in (select id from user)

What I need is the new ids that are in post table and there are not in user table.

Here is the explain:

> mysql> explain select user_id from post where user_id not in (select
> id from user);
> 

+----+--------------------+-------+-----------------+---------------+---------+---------+------+----------+-------------+ 
| id | select_type        | table | type            | possible_keys | key     | key_len | ref  |rows      | Extra       |
+----+--------------------+-------+-----------------+---------------+---------+---------+------+----------+-------------+ 
|  1 | PRIMARY            | post  | ALL             | NULL          |NULL     | NULL    | NULL | 16076920 | Using where | 
|  2 | DEPENDENT SUBQUERY | user  | unique_subquery | PRIMARY       | PRIMARY | 8       | func |        1 | Using index |
+----+--------------------+-------+-----------------+---------------+---------+---------+------+----------+-------------+

I have tried this other too:

SELECT p.user_id FROM post p LEFT JOIN user u ON p.user_id=u.id WHERE u.id IS NULL;

The explain:

mysql> EXPLAIN SELECT p.user_id FROM post p LEFT JOIN user u ON p.user_id=u.id WHERE u.id IS NULL; 
+----+-------------+-------+--------+---------------+---------+---------+-----------------+----------+--------------------------------------+ 
| id | select_type | table | type   | possible_keys | key     | key_len | ref             | rows     | Extra                                |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+----------+--------------------------------------+
|  1 | SIMPLE      | p     | ALL    | NULL          | NULL    | NULL    | NULL            | 14323335 |                                      |
|  1 | SIMPLE      | u     | eq_ref | PRIMARY       | PRIMARY | 8       | ghost.p.user_id |        1 | Using where; Using index; Not exists |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+----------+--------------------------------------+

Both queries have to see the entire table post, and it is very huge: post table: 16077899 entries user table: 9657158 entries

The query would take several minutes (more than 30min) to perform, any tip ?

Thanks!

Kickstart
  • 21,403
  • 2
  • 21
  • 33
brb
  • 109
  • 1
  • 9
  • 1
    Your second query should return a lot of null rows, since u.id will only show up if it's null. Maybe you want p.user_id instead. Does it run as slow as the first query? – Joachim Isaksson Jun 17 '14 at 10:35
  • Can you add the table declares please? – Kickstart Jun 17 '14 at 10:35
  • Joachim, you are right, I have changed the second query. – brb Jun 17 '14 at 12:21
  • Kickstart, what do you mean by "table declares" the show create table ? – brb Jun 17 '14 at 12:28
  • Yes, just to see indexes, etc. However as your query would appear to return ~6 million rows minimum and you have nothing to narrow down the posts returned I think it will take quite a while to run – Kickstart Jun 17 '14 at 12:40
  • what about using the `not exists` syntax such as here. http://stackoverflow.com/questions/14190788/subqueries-with-exists-vs-in-mysql and is there an index on p.user_ID or has it been setup as a foreign key to u.user_ID? – xQbert Jun 17 '14 at 12:51
  • no indexes or foreign key. I'm considereing to add it, but I don't know much it will take to create it with 16millions rows. – brb Jun 17 '14 at 13:26
  • I just check not exists and It looks the same, so It looks like it is not a query optimization problem but a bad table plan. I will have to figure out how to fix it. – brb Jun 17 '14 at 13:35

2 Answers2

0

I am no expert in database, but your first query explain didn't use a Index on userid, do you have any index built on user_id field in post table? If not just create that. Also, you can try group by/distinct to filter your users, as first query will otherwise return multiple userid from post. This increase speed in all.

Sumit Gupta
  • 2,152
  • 4
  • 29
  • 46
0

I think you should do two things...

  1. Make sure you have and index on post (user_id)
  2. Add the DISTINCT keyword to your query, like this:

    SELECT DISTINCT user_id FROM post WHERE user_id NOT IN (SELECT id FROM user)

Check the new EXPLAIN PLAN.

Frazz
  • 2,995
  • 2
  • 19
  • 33
  • adding the index, it is going to take a while... there are 14M rows. – brb Jun 17 '14 at 13:36
  • You are asking about an optimization... adding an index may well be worth it. Be patient ;) – Frazz Jun 17 '14 at 13:39
  • I don't think the index will help for this query (might well for other queries on the table). It is still the primary table in the join (so no record is being searched for on that table to match a value) and there is nothing being checked for that value that will narrow down the query. – Kickstart Jun 17 '14 at 16:42