1

My query is quite simple:

select  a.ID, a.adres, a.place, a.postalcode  
from COMPANIES a, COMPANIES b  
where a.Postcode = b.Postcode  
and a.Adres = b.Adres  
and (  
select COUNT(COMPANYID)  
from USERS  
where COMPANYID=a.ID  
)>(  
select COUNT(COMPANYID)  
from USERS  
where COMPANYID=b.ID  
)

Database: sql server 2008 r2

What I'm trying to do: The table of COMPANIES contains double entries. I want to know the ones that are connected to the most amount of users. So I only have to change the foreign keys of those with the least. ( I already know the id's of the doubles)

Right now it's taking a lot of time to complete. I was wondering if if could be done faster

Wouter Verleur
  • 61
  • 2
  • 14
  • 2
    Do you have indexes defined on the join and where clause columns? – Oded Apr 06 '12 at 08:25
  • Yes, The ID is the primary key of the companies and the companyid is indexed because we use the connection quite often to search for users. – Wouter Verleur Apr 06 '12 at 08:27
  • And the `Postcode` and `Adres` fields on the self join? – Oded Apr 06 '12 at 08:28
  • postcode yes (nvarchar) adres no (nvarchar) – Wouter Verleur Apr 06 '12 at 08:29
  • 2
    That may be why you are getting a slow query. You should do a proper `JOIN` clause, by the way, instead of a cross join with `WHERE` (more of a stylistic issue than performance). – Oded Apr 06 '12 at 08:32
  • the same query without the connection to users (just to find the id's of the doubles) is fast. The join did do a little performance upgrade: Instead of running for 2 minutes it now takes 1 minute 50 seconds on a small version of the tables. Is this because of the missing index? BTW, Thanks for the help. – Wouter Verleur Apr 06 '12 at 08:38

3 Answers3

3

Try this version. It should be only a little faster. The COUNT is quite slow. I've added a.ID <> b.ID to avoid few cases earlier.

select  a.ID, a.adres, a.place, a.postalcode  
from COMPANIES a INNER JOIN COMPANIES b
ON
a.ID <> b.ID
and a.Postcode = b.Postcode  
and a.Adres = b.Adres  
and (  
select COUNT(COMPANYID)  
from USERS  
where COMPANYID=a.ID  
)>(  
select COUNT(COMPANYID)  
from USERS  
where COMPANYID=b.ID  
)

The FROM ... INNER JOIN ... ON ... is a preferred SQL construct to join tables. It may be faster too.

Michas
  • 8,534
  • 6
  • 38
  • 62
  • A little faster? The time for my testset was shortened from 2 minutes to 9 second. A saving of 92,5%. Testing it on the full version now.. _Just to inform you: You're missing an `and` between `b.ID` and `a.postcode`_ – Wouter Verleur Apr 06 '12 at 09:16
  • 2
    @WouterVerleur It is addition of "a.ID <> b.ID" resulting in discarding self-matches and consequently in a lot less counting. – Nikola Markovinović Apr 06 '12 at 09:19
  • As per @NikolaMarkovinović, its the ID check that's speeding the query up but I'd re-iterate the advise to use the inner join ... on rather than where - it makes your queries clearer (the where clause is then about filtering the data and the joins between tables are separate) – kaj Apr 06 '12 at 09:26
0

One approach would be to pre-calculate the COMPANYID count before doing the join since you'll be repeatedly calculating it in the main query. i.e. something like:

insert into @CompanyCount (ID, IDCount)
select COMPANYID, COUNT(COMPANYID)
from USERS
group by COMPANYID

Then your main query:

select a.ID, a.adres, a.place, a.postalcode
from COMPANIES a
  inner join @CompanyCount aCount on aCount.ID = a.ID
  inner join COMPANIES b on b.Postcode = a.Postcode and b.Adres = a.Adres
  inner join @CompanyCount bCount on bCount.ID = b.ID and aCount.IDCount > bCount.IDCount

If you want all instances of a even though there is no corresponding b then you'd need to have left outer joins to b and bCount.

However you need to look at the query plan - which indexes are you using - you probably want to have them on the IDs and the Postcode and Adres fields as a minimum since you're joining on them.

kaj
  • 5,133
  • 2
  • 21
  • 18
  • Tried: Gives me only 2 rows and the original gives me 34 rows, though nothing seems to be missing. I only want to compare `a` that has a counterpart `b` (only doubles). – Wouter Verleur Apr 06 '12 at 09:06
  • Since @Michas has got you a working answer (I've upvoted it), I'll not pursue this - I'd need more info on your data probably - this was written without my testing it, it works in theory :-) One thing I would advise is avoiding columns called "ID" - if they're specific like your CompanyID it reduces confusion in the query – kaj Apr 06 '12 at 09:23
  • To be honest, I try to avoid fieldnames shorter than 5 characters. But I wasn't the one who created the database. And the one that did created the database back in 2000. The work I'm doing is for a migration to a new system (with a completely new database designed by me). Thanks for the advise. I'll keep it in mind. – Wouter Verleur Apr 06 '12 at 09:29
0
  1. Build an index on postcode and adres

  2. The database probably executes the subselects for every row. (Just guessing here, veryfy it in the explain plan. If this is the case you can rewrite the query to join with the inline views (note this is how it would look in oracle hop it works in sql server as well):

    select distinct a.ID, a.adres, a.place, a.postalcode  
    from 
        COMPANIES a, 
        COMPANIES b,  
    (
        select COUNT(COMPANYID) cnt, companyid  
        from USERS
        group by companyid) cntA,  
    (
        select COUNT(COMPANYID) cnt, companyid  
        from USERS
        group by companyid) cntb   
    where a.Postcode = b.Postcode  
    and a.Adres = b.Adres  
    and a.ID<>b.ID
    and cnta.cnt>cntb.cnt
    
Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • This does not work in SQL Result: Msg 8120, Level 16, State 1, Line 6 `Column 'USERS.COMPANYID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.` – Wouter Verleur Apr 06 '12 at 09:08
  • Sorry, missed the group by. Added it. Probably contains still one or more typos .. didn't try it on a database – Jens Schauder Apr 06 '12 at 17:19
  • I won't be able to test it until Tuesday, but I'll see if it works then. Both of us might be able to learn from it a bit right? – Wouter Verleur Apr 06 '12 at 21:36
  • This one also works but misses: ` distinct` after the select and. `a.ID<>b.ID and` after the where – Wouter Verleur Apr 10 '12 at 10:20