0

What is the most efficient way to get comparison statistics for data from 2 large tables? Is it better to reply on SQL to do the comparisons or it can be equivalently efficient doing it in C# using LINQ? How to do it if we do it in C#?

For example, I have 2 oracle tables: A and B. A and B have the same columns:

  • Location (string)
  • category (string)
  • new_model (Y/N)
  • item_code (string)

A table has about 80,000 records; B table has about 20,000 records. The comparisons needed are the folowing.

For each location:

  1. How many items in A match items in B with the condition of the same location, the same category and the same model?

  2. How many items in A match items in B with the condition of the same location, the same category but different model?

  3. How many items in A that are in B but at a different location?

  4. How many items are in A but not in B?

Thank you for your help!

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
user981848
  • 295
  • 2
  • 4
  • 14
  • The most efficient way is to do the processing in SQL. This allows you to take advantage of query optimizations and parallel processing. And, it reduces the amount of data being transferred back and forth. If you actually have a question on how to do this, create a question with the table layouts, sample data, and desired results. Do so either by editing this question or deleting this one and writing another. – Gordon Linoff May 01 '14 at 18:55
  • possible duplicate of [LINQ-to-SQL vs stored procedures?](http://stackoverflow.com/questions/14530/linq-to-sql-vs-stored-procedures) – Yurii May 01 '14 at 18:55
  • Thank you for your input. I basically want to confirm whether SQL is the best choice. Adding a twist to this question, if these two tables are in two database, so doing SQL requires to create Database Link. Is SQL still the best choice? Appreciate your input! – user981848 May 02 '14 at 15:44

1 Answers1

0

Let the SQL engine do what it's designed to do.

The first two questions are easy. Given a schema like this:

create table foo
(
  location  varchar(200) ,
  category  varchar(32)  ,
  new_model char(1)      check( new_model in ('Y','N') ) ,
  item_code varchar(32)  ,
)

create table bar
(
  location  varchar(200) ,
  category  varchar(32)  ,
  new_model char(1)      check( new_model in ('Y','N') ) ,
  item_code varchar(32)  ,
)

Two almost identical queries will do the trick:

select a.location , count(*) as question_1
from foo a
join bar b on b.location  = a.location
          and b.category  = a.category
          and b.new_model = a.new_model
group by a.location
order by a.location

select a.location , count(*) as question_2
from foo a
join bar b on b.location   = a.location
          and b.category   = a.category
          and b.new_model != a.new_model
group by a.location
order by a.location

Assuming reasonable indexing, performance should be good.

Given the information you've given us, the last two questions

  1. How many items in A that are in B but at a different location?
  2. How many items are in A but not in B?

can't be answered as we don't have a definition of what uniquely identifies the rows in each table.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
  • Thank you for your answer. If doing it in SQL is the best choice, I know how to write the queries. Basically, I want to confirm that SQL is the unarguable choice. To be clear, these two tables are in two different database. So if SQL is the choice, then we need to create Database Link between these two database. Adding this twist, do you still recommend to do it in SQL? – user981848 May 02 '14 at 15:33
  • For the last two questions, the item_code is the identifier of the item. – user981848 May 02 '14 at 15:36