How to properly avoid insert duplicate rows in SQL Server data tables which don't have any PK?

Question

I'm using SqlBulkCopy class in C# to copy data from one SQL Server database to another in a fast way. The databases are in different servers and their datatables don't have any PK, so the process gets more complicated.

The problem is that the query I'm using to select data from the original database gets duplicate rows and SqlBulkCopy cannot avoid insert duplicate records in destination database.

I cannot use SELECT * because it throws an OutOfMemoryException, so I do SELECT TOP X * and load that data into a DataTable. In each Datatable I can remove the duplicate records using C#, but then when I select the next TOP X, the first row selected may be equal to the last one that was in the previous DataTable and has been already inserted into the destination database. The DataTable variable is always the same, it is reloaded!

I want to avoid duplicate records to be inserted without create PK because it's not applicable to my case. I really need to use SqlBulkCopy because a fast copy is a system requirement. Any suggestion? Thank you in advance!

Load the data into a staging table, and then move the data from the staging table to the final table, removing duplicates in the database. — Gordon Linoff, May 29 '17 at 18:02
I've thinking about it, maybe it is really the only option to this situation. I need to study more about this issue. Are stating tables the same as temporary tables? And should they be created in the same database as the final tables? — AnnaB, May 29 '17 at 18:12
It doesn't work because there are equal values in the same columns. — AnnaB, May 29 '17 at 18:41
Isn't equal values in same columns the definition of duplicate... — James Z, May 29 '17 at 19:23
Yes, I know James Z. What I was trying to say is that SELECT DISTINCT doesn't work because in my case there are equal values in the same column so if I select distinct some rows would be missing, — AnnaB, May 29 '17 at 20:07
Copy data to a view in SQL server. Then filter the view before storing data into actual database. Finally delete the view. — jdweng, May 29 '17 at 20:21
I don't think that could be a good choice because it needs to be an automatic process... — AnnaB, May 29 '17 at 20:38

Tom Bowen · Accepted Answer · 2017-05-30T10:39:58.033

0

Don't use C#.

You can right click your origin database in SSMS and choose "Tasks" and then "Generate Scripts". Choose the table you want and use the wizard to generate your insert scripts. Then run these on your second database.

If this action need to be repeated you could set up a Linked Server between your two SQL Server instances, and then write an insert statement from one to the other in a Stored Procedure. You can then run this stored procedure whenever you need, or call it from C#.

If you want it to run regularly you could set up a Job on the database.

edited May 30 '17 at 10:39

answered May 29 '17 at 21:19

Tom Bowen

8,214
4
22
42

It's not a viable solution because the original database is always being increased. I need to use a OOP language because the solution requires other stuff. Any other suggestion Tom.Bowen? Thank you for the answer! – AnnaB May 29 '17 at 21:49
Why does it have to be an OOP language? Do you just need something that can be done automatically over and over again? – Tom Bowen May 30 '17 at 05:35
@mvc I've added some more options to achieve what you need. – Tom Bowen May 30 '17 at 10:41
Tom, thank's a lot! I've read about Linked Server and I think it might be a great solution for my problem! Although, I have some doubts, the large amount of information I've read confused me. Do I have to create Linked Server in both servers? – AnnaB Jun 03 '17 at 17:57
@mvc, no you only need to create the Linked Server on the database you want to run the script/procedure on. – Tom Bowen Jun 04 '17 at 21:35
Tom, even if I use INSERT INTO x SELECT FROM y, do I just need to create the Linked Server in one database? If so, in which one should I create, in x or y? – AnnaB Jun 05 '17 at 13:50
@mvc, Whichever server you are running the query on. – Tom Bowen Jun 05 '17 at 13:55
So, if I'm running the .exe C# in y server computer should I create the Linked Server in y? Sorry, I'm a beginner. – AnnaB Jun 05 '17 at 14:58
The C# code is going to connect to a database (x) and execute some SQL code "INSERT INTO ....etc". You need to set up the Linked Server on this database (x), as it is the one that is executing the SQL code. – Tom Bowen Jun 05 '17 at 15:11
I've created a linked server on database (x). However I got the error "Could not find server 'SERV\INST_EX' in sys.servers. Verify that the correct server name was specified. If necessary, execute the stored procedure sp_addlinkedserver to add the server to sys.servers." The server that is not found is the one where I've created linked server. Maybe I must create a linked server in the other server (?) I'm using queries like: INSERT INTO 0 SELECT 1 FROM 2 WHERE NOT EXISTS (SELECT * FROM 3 WHERE 4.5 = 6.7). 4 refers to one server and 6 refers to other server. I'm stuck here. Any suggestion? – AnnaB Jun 06 '17 at 12:59

score 0 · Answer 2 · answered May 30 '17 at 10:46

Have you considered copying the rows out of the first database onto a file on disk rather than in memory? Then you will be able to get all of them in one go rather than needing to make batches with select top X *. Once the data is on disk it can be sorted -- perhaps even with an implementation of Unix sort that handles large files -- and duplicate records removed.

If you want to remove duplicates then at some point you are going to need to have all the data in one place and either sort it or make an index on it. That can be in the first database, in memory, on disk, or in the second database. There are reasons why you don't want to make indexing in either of the databases, and there isn't room for all the data in memory, so that seems to leave spooling it to disk as the only option.

Personally, though, I would think very hard about making a primary key. Although you say it's not applicable, it may be worth having it just to help with data loading.

How to properly avoid insert duplicate rows in SQL Server data tables which don't have any PK?

2 Answers2