0

I'm trying to find and remove duplicates from a table consisting of 29 columns as follows:

[Partitioned Commodity Code] (nvarchar(max),null)
[Supplier] (nvarchar(255),null)
[Manufacturer] (nvarchar(255),null)
[Order ID] (nvarchar(55),null)
[Status] (nvarchar(25),null)
[Ordered Date] (nvarchar(105),null)
[Order Method] (nvarchar(55),null)
[Ship to Address] (nvarchar(255),null)
[Address] (nvarchar(255),null)
[City]  (nvarchar(55),null)
[State]  (nvarchar(55),null)
[Postal Code]  (nvarchar(55),null)
[Requisitioner] (nvarchar(100),null)
[Accounting Entity] (nvarchar(255),null)
[Preparer] (nvarchar(255),null)
[Item No] (float,null)
[Description] (nvarchar(max),null)
[Nominal Account] (nvarchar(255),null)
[LineItem ProjectCode] (float,null)
[Cost Center] (nvarchar(100),null)
[CostCenter Description] (nvarchar(max),null)
[Supplier Part Number] (float,null)
[Is Catalogue] (nvarchar(55),null)
[Unit Price] (float,null)
[Quantity] (float,null)
[Line Item Total] (float,null)
[Line Item Total USD] (float,null)
[Currency] (nvarchar(55),null)
[Request Cross Reference] (nvarchar(55),null)

In this dataset, there is no primary key and honestly I can't figure out what combination of keys can be treated as primary key to find and remove duplicates.

I tried the code below considering [Order ID]:

select *
from [PO_BuySmart_Main] a
join 
    (select [Order ID], count(*) as OID  
     from [PO_BuySmart_Main] 
     group by [Order ID] 
     having count(*) > 1 ) b on a.[Order ID] = b.[Order ID]
order by a.[Order ID]

I want to consider all columns to check duplicate for and delete those records accordingly.

[PO_BuySmart_Main] is the table name for which a single stored procedure will do this task.

Any help would be much appreciated, thanks.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
sneha
  • 11
  • 4
  • Does this answer your question? [How to delete duplicate rows in SQL Server?](https://stackoverflow.com/q/18390574/2029983) If not, why not? – Thom A Jan 24 '21 at 13:14
  • It's probably a denormalized table, so for example you will have multiple `OrderId` with different `LineItem ProjectCode` (which **really**s hould not be a float). So you cannot just deduplicate, you need to work out how to normalize it – Charlieface Jan 24 '21 at 13:19
  • First, I would like to find duplicate records i.e. what is the count; then once it is is known I'll go after removing them. @Larnu – sneha Jan 24 '21 at 13:20
  • Storing something like a price, in this case `[Line Item Total USD]`, is a terrible idea. `float` is not a precise data type, and so you could easily end up with problems. Also, I really suggest not using object names that require delimit identification. – Thom A Jan 24 '21 at 13:23
  • I am not a DB administrator so will not have permissions to work on tables instead; only I can write queries. @Charlieface – sneha Jan 24 '21 at 13:23
  • *"then once it is is known I'll after removing them"* So the duplicate *does* answer the question? – Thom A Jan 24 '21 at 13:24
  • Then you need to convert all this in your queries. Like I said, you probably don't have duplicate rows, just denormalized tables. @Larnu OP has things like `Item No` in `float`s which is just daft, they're not going to be decimal anyway – Charlieface Jan 24 '21 at 13:25
  • would like to write a single stored procedure which can find and delete all the duplicate records in one go considering all 29 columns in one shot from only one table [PO_BuySmart_Main] from that particular database @Larnu – sneha Jan 24 '21 at 13:26
  • I've closed this as a duplicate, as it does answer the question you are asking; I don't understand your comments as they aren't addressing the comments I make; they appear to be incomplete additions to the question, which should be an [edit]. If the duplicate doesn't answer the question, [edit] the question to explain why, and I'll happily reopen it. I do, however, suggest fixing those data types like myself and @Charlieface have suggested. That, however, is a different question. – Thom A Jan 24 '21 at 13:31
  • _I am not a DB administrator_ This is the first person you should ask for help, followed by the person/group that designed the table. – SMor Jan 24 '21 at 13:43
  • The linked duplicate question has a good methodology of accomplishing that. Another alternative you can do is use the [HASHBYTES()](https://learn.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver15) function to generate a unique hash for the row by concatenating the values of every column in it. Then you can easily identify dupes by joining the data set to itself on the row hash. Please ensure to use a distinct enough algorithm and read on the intricacies of the function first though. – J.D. Jan 24 '21 at 14:33

0 Answers0