5

I have a problem where I have to try to find people who have old accounts with an outstanding balance, but who have created a new account. I need to match them by comparing SSNs. The problem is that we have primary and additional contacts, so 2 potential SSNs per account. I need to match it even if they where primary at first, but now are secondary etc.

Here was my first attempt, I'm just counting now to get the joins and conditions down. I'll select actual data later. Basically the personal table is joined once to active accounts, and another copy to delinquent accounts. The two references to the personal table are then compared based on the 4 possible ways SSNs could be related.

select count(*) 
from personal pa
join consumer c 
on c.cust_nbr = pa.cust_nbr
and c.per_acct = pa.acct
join personal pu
on pu.ssn = pa.ssn
or pu.ssn = pa.addl_ssn
or pu.addl_ssn = pa.ssn
or pu.addl_ssn = pa.addl_ssn
join uncol_acct u 
on u.cust_nbr = pu.cust_nbr
and u.per_acct = pu.acct
where u.curr_bal > 0

This works, but it takes 20 minutes to run. I found this question Is having an 'OR' in an INNER JOIN condition a bad idea? so I tried re-writing it as 4 queries (one per ssn combination) and unioning them. This took 30 minutes to run.

Is there a better way to do this, or is it just a really inefficient process no mater how you do it?

Update: After playing with some options here, and some other experimenting I think I found the problem. Our software vendor encrypts the SSNs in the database and provides a view that decrypts them. Since I have to work from that view it takes a really long time to decrypt and then compare.

Community
  • 1
  • 1
Mike
  • 107
  • 1
  • 7
  • 3
    How are the tables indexed? You might get a significant speed-up if you index by ssn and addl_ssn (separate indexes for each) – SWeko Aug 15 '13 at 12:39
  • The database is managed by our software vendor. I'm not sure how indexing is setup or if I could have them change that. I could check into that. – Mike Aug 15 '13 at 12:52

1 Answers1

2

If you run separate joins and then union then, then you might have problems. What if the same record pair fulfills at least two conditions? You will have duplicates in your result then.

I believe your first approach is feasible, but do not forget that you are joining four tables. If the number of rows is A, B, C, D in the respective tables, then the RDBMS will have to check a maximum of A * B * C * D records. If you have many records in your database, then this will take a lot of time.

Of course, you can optimize your query by adding indexes to some columns and that would be a good idea if they are not indexed already. But do not forget that if you add an index to a column, then the RDBMS will be quicker to read from there, but slower to write there. If your operations are mostly reads (select), then you should index your columns, but not blindly, study indexing a bit before you start doing it.

Also, if you are joining four tables, personal, consumer, personal (again) and uncol_acct, then you might do something like this:

Write a query, which contains two subqueries, each of them named as t1 and t2, respectively. The first subquery joins personal and consumer and will name the result as t1. The second query will join the second occurrence of personal with uncol_acct and the where clause will be inside your second join. As described before, your query will contain two subqueries, named t1 and t2, respectively. Your query will join t1 and t2. This way you opimise, as your main query will consider only the pairing of valid t1 and t2.

Also, if your where clause is outside as in your example query, then the 4-dimensional join will be executed and only after that will the where be taken into consideration. This is why the where clause should be inside the second sub-query, so the where clause will run before the main join. Also, you can create a subquery inside the second subquery to calculate the where if the condition is fulfilled rarely.

Cheers!

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175
  • Indexing is handled by our software vendor and I don't know if I could have them change it or not. This is a rare procedure and not time sensitive so it's probably not worth that. I like the 2 sub-queries idea, I'll see if it's any faster. As for duplicate records, I don't think it's happening because my counts are the same for both methods. Also it wouldn't be terribly important in this case because someone will be going over the list to try to track down the people. If they show up twice it's ok. – Mike Aug 15 '13 at 13:00
  • Duplicate records will occur in the case when you are unioning if and only if there is at least a pair where more than one of the joins conditions is fulfilled. You can prevent this by using the distinct keyword. However, the solution proposed in my answer prevents this because the queries are not partitioned, only the joins are re-formatted using sub-queries to speed up things. – Lajos Arpad Aug 15 '13 at 13:51
  • I tried the sub query method and it took almost exactly as long as the the method without sub queries. Less than a second difference actually. I think I discovered the problem though. Our software vendor encrypts the SSNs. We get a table view that's unencrypted. So it has to run a decryption routine on every SSN and takes a really long time. – Mike Aug 15 '13 at 14:58
  • The optimisation will only have effect if you have a lot of records in your tables. Can you specify how many records you have in each of the four tables? – Lajos Arpad Aug 15 '13 at 18:47
  • About 70,000 in personal 60,000 in consumer, 14,000 in uncollectable. – Mike Aug 16 '13 at 12:33
  • That's not too much, speed problems due to queries begin to occur when you have millions or at least hundreds of thousands of data in the tables you are joining. – Lajos Arpad Aug 16 '13 at 13:35