3

I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.

E.g. Data:

Item Number    |    Item Name     |     Update Date
ITEM-01        | First Item       | 1-Jan-2013
ITEM-01        | First Item       | 5-Jan-2013
ITEM-24        | Another Item     | 12-Mar-2012
ITEM-24        | Another Item     | 13-Mar-2012
ITEM-24        | Another Item     | 14-Mar-2012

Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.

I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.

So is there any other way?

sallushan
  • 1,134
  • 8
  • 16

2 Answers2

5

There are a couple of approaches you can take to do this.

1. Aggregate Transformation

Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:

enter image description here

2. Script Component Transformation

Essentially, you could implement the logic you mentioned above:

if next item number = previous item number then do NOT import this line

First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):

  1. Select Transformation as the Script Component type
  2. Add the Script Component after the Flat File Source in your Data Flow:

    enter image description here

  3. Double Click the Script Component to open the Script Transformation Editor.
  4. Under Input Columns, select all columns:

    enter image description here

  5. Under Inputs and Outputs, select Output 0, and set the SynchronousInputID property to None

    enter image description here

  6. Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):

    enter image description here

  7. Finally, edit the script. There will be a method named Input0_ProcessInputRow- modify it as below and add a private field named previousItemNumber as below:
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (!Row.ItemNumber.Equals(previousItemNumber))
        {
            Output0Buffer.AddRow();
            Output0Buffer.ItemName = Row.ItemName;
            Output0Buffer.ItemNumber = Row.ItemNumber;
            Output0Buffer.UpdateDate = Row.UpdateDate;
        }  

        previousItemNumber = Row.ItemNumber;
    }

    private string previousItemNumber = string.Empty;
gannaway
  • 1,872
  • 12
  • 14
  • Thanks gannaway for your answer, the sample data I provide is NOT the actual data. Actual data has too many columns, therefore 1st approach is NOT applicable. I'll try to work with the 2nd approach and post the updates. – sallushan Dec 29 '13 at 19:45
  • Thinking about this more, you could go a step further and just do a custom Script Component Source and only read the unique records into the pipeline (the Script Component Transformation approach above reads all records into the pipeline and then filters them out). It would be a bit more work as the developer is responsible for writing the code to read the contents of the file. Also, I'm not sure how much of a performance benefit it would provide (if any). But, it's another option regardless. – gannaway Dec 30 '13 at 16:22
  • Yes I think I have to go with the custom code, either with SSIS or some custom C# tool, because file has some data errors as well which is preventing the file to work (completely) with Flat File Source. Probably I need to clean the file before it can be given to import process. Thanks gannaway for your help. – sallushan Dec 30 '13 at 18:24
0

If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT * to get the desired values.

Eli Ekstein
  • 466
  • 8
  • 19
  • Thanks Eli, the problem is data is too big, if I am going to import the whole data, it is going to take time plus it is going to dramatically increase the database size. So this is NOT a good idea for us. – sallushan Dec 30 '13 at 11:30
  • 1
    @sallushan perhaps a temporary db? – Eli Ekstein Jan 05 '14 at 20:46