What is the most efficient way to get comparison statistics for data from 2 large tables

Question

What is the most efficient way to get comparison statistics for data from 2 large tables? Is it better to reply on SQL to do the comparisons or it can be equivalently efficient doing it in C# using LINQ? How to do it if we do it in C#?

For example, I have 2 oracle tables: A and B. A and B have the same columns:

Location (string)
category (string)
new_model (Y/N)
item_code (string)

A table has about 80,000 records; B table has about 20,000 records. The comparisons needed are the folowing.

For each location:

How many items in A match items in B with the condition of the same location, the same category and the same model?
How many items in A match items in B with the condition of the same location, the same category but different model?
How many items in A that are in B but at a different location?
How many items are in A but not in B?

Thank you for your help!

The most efficient way is to do the processing in SQL. This allows you to take advantage of query optimizations and parallel processing. And, it reduces the amount of data being transferred back and forth. If you actually have a question on how to do this, create a question with the table layouts, sample data, and desired results. Do so either by editing this question or deleting this one and writing another. — Gordon Linoff, May 01 '14 at 18:55
possible duplicate of [LINQ-to-SQL vs stored procedures?](http://stackoverflow.com/questions/14530/linq-to-sql-vs-stored-procedures) — Yurii, May 01 '14 at 18:55
Thank you for your input. I basically want to confirm whether SQL is the best choice. Adding a twist to this question, if these two tables are in two database, so doing SQL requires to create Database Link. Is SQL still the best choice? Appreciate your input! — user981848, May 02 '14 at 15:44

score 0 · Accepted Answer · answered May 01 '14 at 19:10

Let the SQL engine do what it's designed to do.

The first two questions are easy. Given a schema like this:

create table foo
(
  location  varchar(200) ,
  category  varchar(32)  ,
  new_model char(1)      check( new_model in ('Y','N') ) ,
  item_code varchar(32)  ,
)

create table bar
(
  location  varchar(200) ,
  category  varchar(32)  ,
  new_model char(1)      check( new_model in ('Y','N') ) ,
  item_code varchar(32)  ,
)

Two almost identical queries will do the trick:

select a.location , count(*) as question_1
from foo a
join bar b on b.location  = a.location
          and b.category  = a.category
          and b.new_model = a.new_model
group by a.location
order by a.location

select a.location , count(*) as question_2
from foo a
join bar b on b.location   = a.location
          and b.category   = a.category
          and b.new_model != a.new_model
group by a.location
order by a.location

Assuming reasonable indexing, performance should be good.

Given the information you've given us, the last two questions

How many items in A that are in B but at a different location?

How many items are in A but not in B?

can't be answered as we don't have a definition of what uniquely identifies the rows in each table.

Thank you for your answer. If doing it in SQL is the best choice, I know how to write the queries. Basically, I want to confirm that SQL is the unarguable choice. To be clear, these two tables are in two different database. So if SQL is the choice, then we need to create Database Link between these two database. Adding this twist, do you still recommend to do it in SQL? — user981848, May 02 '14 at 15:33
For the last two questions, the item_code is the identifier of the item. — user981848, May 02 '14 at 15:36

What is the most efficient way to get comparison statistics for data from 2 large tables

1 Answers1