Best way of dealing with "duplicates"

Question

I have a Stored Procedure that extracts data from a ERP into a datawarehouse (DW). The problem that I facing is that the people from the DW asked me to have an unique ID for each row which I don't have.

An Example

CREATE TABLE [dbo].[INVOICES](
    [InvID] [nchar](10) NULL,
    [LineID] [nchar](10) NULL,
    [Amount] [decimal](32, 4) NULL
) ON [PRIMARY]

CREATE TABLE [dbo].[DISTRIBUTION](
    [InvID] [nchar](10) NULL,
    [LineID] [nchar](10) NULL,
    [Percent] [decimal](3, 2) NULL
) ON [PRIMARY]

InvID      LineID     Amount
---------- ---------- ---------------------------------------
FC0000001  1          6788.0000
FC0000001  2          8908.0000

InvID      LineID     Percent
---------- ---------- ----------------------
FC0000001  1          10
FC0000001  1          90
FC0000001  2          100

So, you can query:

SELECT I.InvID + '-' + D.LineID AS ID, D.* 
FROM [INVOICES] AS I
LEFT JOIN [DISTRIBUTION] AS D 
     ON  D.InvID  = I.InvID 
     and D.LineID = I.LineID

And will get:

ID                   InvID      LineID     Percent
-------------------- ---------- ---------- ----------------------
FC0000001-1          FC0000001  1          10
FC0000001-1          FC0000001  1          90
FC0000001-2          FC0000001  2          100

But, as you see, If there is two records at [DISTRIBUTION] for the same LineID and InvID, you get two records from the query above and, the table [DISTRIBUTION] doesn't contain an ID for each line.

The Problem

So, I need to create an ID on the fly that is:

Unique (each row must be different)
Repeteable (every time I run the process, the same records must have the same ID value)
I can't have an ID at the ERP level. I can't just add and ID to the [DISTRIBUTION] table because it's a third party, poorly designed ERP.

What I've already tried

Binary Check sums are not safe and, if data change the ID changes.
ROW_NUMBER() is what i'm using but... I'm not confortable with it. Affects performance and, there is a chance that it changes.
Quiting my job is not an option :)

I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". — John Saunders, Oct 05 '14 at 18:59
Which ERP is it - there may be people who have solved the issue you have already? Have you contacted their support? — Ryan Vincent, Oct 05 '14 at 19:26
Is it possible for the distribution to ever change, or for the line items to ever change? If so, I think "repeatable" is impossible unless you store the mapping between distributions and line items. — Mike Sherrill 'Cat Recall', Oct 05 '14 at 19:38
Could you please clerify this requirement - 'Repeteable (every time I run the process, the same records must have the same ID value).' What exactly defines that the record is the same? What data could be changed and record should be still counted the same? In case all columns may change you need physical reference of any record which is unsafe by its nature, I think. So, if you have unique content in `DISTRIBUTION` use binary checksum - if data changed this means that record is deleted. — Serg, Oct 06 '14 at 08:31
You say you cannot use "binary checks" to make some sort of aggregated ID, because the data might change, but it is also the data that makes you have your "duplicate" rows? If Percent changes, the row changes and then should it have the same identification? — Allan S. Hansen, Oct 06 '14 at 10:45
If the data can change and you cannot add a column then you cannot generate a unique row ID from that join that is repeatable. — paparazzo, Oct 06 '14 at 14:34

score 0 · Answer 1 · edited May 23 '17 at 12:22

0

In Oracle there is ROWID for each row in each Table, on SQL Server there is an undocumented column %%physloc%% as the same of ROWID.

So in your case you can try.

SELECT %%physloc%%, * 
FROM [DISTRIBUTION]

See very well written answer on stackoverflow for more info on %%physloc%%

edited May 23 '17 at 12:22

Community

1
1

answered Oct 06 '14 at 10:25

Max

6,821
3
43
59

Best way of dealing with "duplicates"

1 Answers1