how to remove duplicate row from output table using Pentaho DI?

Question

I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.

Now I want to remove all duplicate row from the output table.

And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.

How I can solve this?

What step do you use for storing data to table? Is it `Table Output` step? — mzy, Apr 30 '15 at 10:57
Is there some key which identifies each row? Or how do you recognize duplicite rows? — mzy, Apr 30 '15 at 11:04
there is no key in that data . we only can compare each column — , Apr 30 '15 at 11:06
That means you need to compare all column from table to find out there is a duplicite rows? Ok. I will post some solution.. — mzy, Apr 30 '15 at 11:11

score 1 · Answer 1 · edited May 23 '17 at 10:27

Two solutions come to my mind:

Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
- If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
- The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
- Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
- Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?

Is it also can effect on my database table ? – May 01 '15 at 06:35 — , May 01 '15 at 06:35
Do you talk about the 2nd suggested solution? - yes – mzy May 02 '15 at 09:50 — mzy, May 02 '15 at 09:50

score 0 · Answer 2 · answered Apr 30 '15 at 17:15

Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.

As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.

Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".

The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.

Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.

Espresso · Answer 3 · 2021-03-07T21:30:45.607

0

I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me

Case

user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.

Solution

Step 1: Connect csv to insert/update as in the below Transformation.

Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.

edited Mar 07 '21 at 21:30

answered Mar 07 '21 at 21:25

Espresso

5,378
4
35
66

how to remove duplicate row from output table using Pentaho DI?

3 Answers3