1

I have three tables jcxx_device_check_log,sys_depart and jcxx_device_info

the jcxx_device_check_log contains 640000 data volume,

the sys_depart contains 260 data volume

the jcxx_device_info contains 840 data volume

the exists sql

EXPLAIN
format = Json
SELECT state,check_time,note,device_id,COUNT(1) FROM jcxx_device_check_log jc
WHERE  exists (
SELECT jdi.id FROM sys_depart sd
INNER JOIN jcxx_device_info jdi on sd.id=jdi.sys_depart_id
where jc.device_id = jdi.id
) and check_time BETWEEN '2023-4-18 10:00:00' AND ' 2023-4-18 18:00:00'
GROUP BY device_id,state
ORDER BY device_id,state

the EXPLAIN format = Json results

"cost_info": { "query_cost": "453972.80" },

the in sql

EXPLAIN
format = Json
SELECT state,check_time,device_id,COUNT(1) FROM jcxx_device_check_log jc
WHERE device_id in (
SELECT jdi.id FROM sys_depart sd
INNER JOIN jcxx_device_info jdi on jdi.sys_depart_id=sd.id
) and check_time BETWEEN '2023-4-18 10:00:00' AND ' 2023-4-18 18:00:00'
GROUP BY device_id,state
ORDER BY device_id,state;

the EXPLAIN format = Json results

"cost_info": { "query_cost": "2025375.58" },

this is my guess

the result set of exists sql does not exist in the memory and only returns true and false.

this is my question

I want to know whether my guess is correct and how to choose a statement next time I encounter a large amount of data

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51

1 Answers1

2

There is no single simple answer to "how to choose a statement next time" as the conditions of different tables/indices/statistics etc. will all influence the outcome. Furthermore, there is a lot of myth and legend around the topics of IN and EXISTS which generally result in a rule of thumb like this: "*The general rule of thumb is that if the subquery contains a large volume of data, the EXISTS operator provides a better performance."

However such rules of thumb are difficult (if not impossible) to prove and yet again may not be true if conditions are suited to use of IN() (e.g. if the in lists an indexed column and the subquery itself is inherently efficient). So be wary of such "rules" and as another prior answer states it:

The allegedly performance and technical differences between EXISTS and IN may result from specific vendor's implementations/limitations/bugs, but many times they are nothing but myths created due to lack of understanding of the databases internals.

You are correct about the YES/NO nature of using EXISTS, it merely tests if a match can be found, and if found returns TRUE, otherwise FALSE. However we cannot know if the subquery used is in memory or not.

For knowing when to choose IN or EXISTS, in truth there is no absolute guide and all I can do is list out some guidance:

  1. If the "selective predicates(s)" is/are self-contained to the subquery use IN e.g.
select    t1.* 
from      t1 
where     t1.x in (select t2.x from t2 where t2.y = 'foo')
  1. If the "selective predicates(s)" is/are from the outer table, use EXISTS (refer)e.g.
select    t1.* 
from      t1 
where     exists (select    null 
                  from      t2 
                  where     t2.x=t1.x
                  )

Also note here that an EXISTS does not actually need to return any values for evaluation so you may use select null or select 1 or select * as you prefer. In older variants of some dbs there was a penalty is using *, but this is not the case for most (or all?) dbs today.

  1. Where the relationships between the outer and inner tables are complex use EXISTS (nb: borrowed from this answer)e.g.
select    t1.* 
from      t1 
where     exists (select    null 
                  from      t2 
                  where     t2.x=t1.x 
                        and t2.y>t1.y 
                        and t2.z like '℅' || t1.z || '℅'
                  )

Note that to approach an equivalence of that using IN is so much more cumbersome (and might not be allowed as t1 is referenced 2 tiers down):

select t1.*
from t1
where t1.x in (select t2.x
               from t2
               where t2.x=t1.x 
               and t2.y > (select t1.y from t1 where t1.x = t2.x)
               and t2.z like '%' || (select t1.z from t1 where t1.x = t2.x) || '%'
               )
  1. If seeking the negative of IN i.e. NOT IN be aware that if the subquery returns NULL then the whole result will be NULL (and this can be unexpected) so take steps to exclude NULL from the subquery result> e.g.
select    t1.* 
from      t1 
where     t1.x NOT in (select t2.x from t2 where t2.x IS NOT NULL)
  • nb: the term "selective predicate" is used to mean any predicate(s) that filters out a significant number of rows.

I recommend you also review the many prior answers to this: "Difference between EXISTS and IN in SQL?"

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51