We have an elt process to store data partitioned by Year in a delta lake processed through Databricks. In Databricks the queried location displays data correctly with no duplications and no total count variation. When I create a view using Synapse Serverless to the same partitioned data is displayed with duplicates after an update happens to the data, when data is created for the first time no issues whatsoever. I have troubleshot and found that it only happens when using views to partitioned data after an update. If using external table with no partition specified, the results are correct as well.
Delta Lake partitioned data overview
On Databricks data is correctly read.
select PKCOLUMNS, count(*) from mytable group by PKCOLUMNS having count(*)>1
-- no duplicates
select count(*) from mytable --407,421
On Synapse Serverless
CREATE VIEW MY_TABLE_VIEW AS
SELECT *,
results.filepath(1) as [Year]
FROM
OPENROWSET(
BULK 'mytable/Year=*/*.parquet',
DATA_SOURCE = 'DeltaLakeStorage',
FORMAT = 'PARQUET'
)
WITH(
[param1] nvarchar(4000),
[param2] float,
[PKCOLUMNS] nvarchar(4000)
) AS [results]
GO
select PKCOLUMNS, count(*) from mytable
group by PKCOLUMNS
having count(*)>1 --duplicates
GO
select PKCOLUMNS, count(*) from mytable
group by PKCOLUMNS
having count(*)>1 --814,842