SSIS with millions of data to compare from source and target

Question

I am trying to learn about SSIS and have few doubts regarding to that.

I have want to compare 2 tables.1 table is in Sql Server and another is in Oracle.

Both table will have same schema like below :

Sql Server:
Id      Amount
1       100
2       200
3       300


Oracle:
Id      Amount
3       3000
2       2000
1       1000

This is just an few sample records as i have 24 millions records in source(12 millions) and target(12 millions) in some random order.

Task : I am trying to compare source and target data.As there will always be 1 to 1 match between source and target data based on joining id column from source and target and doing comparision on Amount column and storing unmatch records in sql server database so i know Look up transformation will do in this case.

But however i have some doubts :

1) If i fire select * from query for source and target then where does 24 millions records will stay? In memory?

2) Can i get memory exception in this case?

3) As both the result set(i.e) are in different order in source and target how does look up will work? Will it load all source data and then it will match 1 by 1 record in target data by not loading whole target data?

4) How does SSIS handles millions of data comparision for source and target?

Can anybody please help me clearing above doubts?

What are you going to do with the data? "Compare" isn't really a term in SQL. — Jacob H, Feb 06 '18 at 14:29
@JacobH I will be storing unmatched data in my database but that is secondary.Before that i would like to know things about SSIS and while doing research i was having some doubts which i have raised in my question — I Love Stackoverflow, Feb 06 '18 at 14:39
Well the answer to your doubts is "it depends" and since you haven't really described what you are trying to do the answer is still "it depends"... there is no such thing as "compare". SSIS can SELECT a million or 24 million rows. SSIS has caching options. The amount of memory used depends on the amount of data queried. Pretty much all the answers are "it depends" unless you have more info for us. — Jacob H, Feb 06 '18 at 14:41
@JacobH I updated my question which now shows what i am trying to do — I Love Stackoverflow, Feb 06 '18 at 14:44
Agree that you have not given enough information to get a better answer than "it depends", but I will tell you that using a Lookup to compare 12 million rows with 12 million rows is a terrible idea. Merge join would be better, but your best solution would be to import to a staging table and do the comparison there. — Tab Alleman, Feb 06 '18 at 14:44
@TabAlleman Sorry for the inconvinience but now i have updated my question with what i am trying to do.Why Look up will be terrible idea because thats is what i am trying to understand that how look up will work on comparing 12 million rows to 12 millions rows? — I Love Stackoverflow, Feb 06 '18 at 14:47

Tab Alleman · Accepted Answer · 2018-02-06T15:06:56.573

If you do this with a Lookup, neither row set will be stored completely in memory, unless you use full cache. If you use cache, then the Target data will be stored in memory, and sure, you could get memory exceptions if you don't have enough available memory.

A lookup is a terrible idea because for every row in the source data you are going to query the target data. So you are going to issue 12 million individual queries against the target before you are done. This is the worst performing option.

A Merge Join is faster because your data is pre-sorted on the matching key so the matching is much faster. Also neither dataset needs to be held in memory. The rows flow freely without waiting for the entire dataset to be loaded.

Here is a comparison between Lookup and Merge Join.

The fastest option would be to load your target data directly to a staging table on the same server as your source data, and index that table on the joining key. Then you can do the comparison in SQL, joining on indexed columns, which would give you the fastest performance.

Upvoted for your kind efforts towards helping me.So how does SSIS performs millions of records comparision from source to target without loading whole data set.Can you please explain me this part as i am failing to understand — I Love Stackoverflow, Feb 06 '18 at 15:05
For a long time i didn't see @TabAlleman writing long answers :) Nice explanation +1 — Hadi, Feb 06 '18 at 19:53
Genius idea. Moved just the primary matching column to the same DB to do comparison avoiding left outer join and avoiding bringing all rows at the same time. — rajquest, Jan 29 '19 at 21:31

score 1 · Answer 2 · answered Feb 07 '18 at 06:03

In addition to Tab's answer, OP asked 'how does SSIS performs millions of records comparision from source to target without loading whole data set'

Answer:

Remember, Merge Join takes sorted input only.

Merge is going to walk through two sets in the order that you gave in your input or using the Sort transformation. So, it loads one record from one input and one record from the second input. If the keys match, it will output the row with information from both inputs. The advantage is that SSIS only needs to retain a couple rows in memory.

What if Microsoft decided that there is no requirement for sorting? Then in order for the Merge to work is that it would load all of the rows from one input into memory and then the Merge would look up the row in memory. That means a large amount of memory would be needed.

source: msdn

SSIS with millions of data to compare from source and target

2 Answers2