I am using delta (OSS - version 0.7.0 with pyspark 3.0.1) and the table is getting modified (merge) every 5 mins - microbatch pyspark script.
When I run for the first time it created 18 small files (numTargetRowsInserted -> 32560) and I used the same data and rerun again though there is no change in the data, table is touched and the version is updated and the number of small files increased to 400 and perviously added 18 files were marked as removed. However, except the first MERGE, subsequent merger is having the following values numTargetRowsCopied -> 32560 in the OperationMetics. Why the target rows copied again and the older files are marked as removed? Am i missing anything?
OperationMetrics data is as below,
operationMetrics |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 68457, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 400, rewriteTimeMs -> 66410]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 400, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 16838, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 48810]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 12399, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 15039] |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 12244, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 14828] |
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 67154, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 400, rewriteTimeMs -> 70194]|
[numTargetRowsCopied -> 32560, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 400, executionTimeMs -> 0, numTargetRowsInserted -> 0, scanTimeMs -> 20367, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 18, rewriteTimeMs -> 80719]|
[numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 18, executionTimeMs -> 0, numTargetRowsInserted -> 32560, scanTimeMs -> 7035, numTargetRowsUpdated -> 0, numOutputRows -> 32560, numSourceRows -> 32560, numTargetFilesRemoved -> 0, rewriteTimeMs -> 11606] |
Merge SQL :
MERGE INTO Target_table tgt
USING Source_table src
ON src.pk_col = tgt.pk_col
WHEN MATCHED AND src.operation=="DELETE" THEN DELETE
WHEN MATCHED AND src.operation=="UPDATE" THEN UPDATE SET *
WHEN NOT MATCHED AND src.operation!="DELETE" THEN INSERT *