3

I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.

Now I want to remove all duplicate row from the output table.

And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.

How I can solve this?

mzy
  • 1,754
  • 2
  • 20
  • 36

3 Answers3

1

Two solutions come to my mind:

  1. Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:

    • If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.

    Use following parameters:

    • The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
    • Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
  2. After storing duplicite values to the output table, you can remove duplicites using this concept:

    • Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?
Community
  • 1
  • 1
mzy
  • 1,754
  • 2
  • 20
  • 36
0

Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.

As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.

Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".

The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.

Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.

Brian.D.Myers
  • 2,448
  • 2
  • 20
  • 17
0

I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me

Case

user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.

Solution

Step 1: Connect csv to insert/update as in the below Transformation.

enter image description here

Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.

Match Record by Pk, and update the values

Espresso
  • 5,378
  • 4
  • 35
  • 66