Another way is to use the Merge rows (diff)
step, followed by a Synchronize after merge
step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff)
takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows
steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge
step then applies the changes marked in the rows to the target table. Note that Synchronize after merge
is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case
or Filter Rows
step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.