5

At the risk of over-explaining my question, I'm going to err on the side of too much information.

I am creating a bulk upload process that inserts data into two tables. The two tables look roughly as follows. TableA is a self-referencing table that allows N levels of reference.

Parts (self-referencing table)
--------
PartId (PK Int Non-Auto-Incrementing)
DescriptionId (Fk)
ParentPartId
HierarchyNode (HierarchyId)
SourcePartId (VARCHAR(500) a unique Part Id from the source)
(other columns)

Description
--------
DescriptionId (PK Int Non-Auto-Incrementing)
Language (PK either 'EN' or 'JA')
DescriptionText (varchar(max))

(I should note too that there are other tables that will reference our PartID that I'm leaving out of this for now.)

In Description, the combo of Description and Language will be unique, but the actual `DescriptionID will always have at least two instances.

Now, for the bulk upload process, I created two staging tables that look a lot like Parts and Description but don't have any PK's, Indexes, etc. They are Parts_Staging and Description_Staging.

In Parts_Staging there is an extra column that contains a Hierarchy Node String, which is the HierarchyNode in this kind of format: /1/2/3/ etc. Then when data is copied from the _Staging table to the actual table, I use a CAST(Source.Column AS hierarchyid).

Because of the complexity of the ID's shared across the two tables, the self-referencing id's and the hierarchyid in Parts, and the number of rows to be inserted (possible in the 100,000's) I decided to 100% compile ALL of the data in a C# model first, including the PK ID's. So the process looks like this in C#:

  1. Query the two tables for MAX ID
  2. Using those Max ID's, compile a complete model of all the data for both tables (inlcuding the hierarchyid /1/2/3/)
  3. Do a bulk insert into both _Staging Tables
  4. Trigger a SP that copies non-duplicate data from the two _Staging tables into the actual tables. (This is where the CAST(Source.Column AS hierarchyid) happens).

We are importing lots of parts books, and a single part may be replicated across multiple books. We need to remove the duplicates. In step 4, duplicates are weeded out by checking the SourcePartId in the Parts table and the Description in the DescriptionText in the Description table.

That entire process works beautifully! And best of all, it's really fast. But, if you are reading this carefully (and I thank if you are) then you have already noticed one glaring, obvious problem.

If multiple processes are happening at the same time (and that absolutely WILL happen!) then there is a very real risk of getting the ID's mixed up and the data becoming really corrupted. Process1 could do the GET MAX ID query and before it manages to finish, Process2 could also do a GET MAX ID query, and because Process1 hasn't actually written to the tables yet, it would get the same ID's.

My original thought was to use a SEQUENCE object. And at first, that plan seemed to be brilliant. But it fell apart in testing because it's entirely possible that the same data will be processed more than once and eventually ignored when the copy happens from the _Staging tables to the final tables. And in that case, the SEQUENCE numbers will already be claimed and used, resulting in giant gaps in the ID's. Not that this is a fatal flaw, but it's an issue we would rather avoid.

So... that was a LOT of background info to ask this actual question. What I'm thinking of doing is this:

  1. Lock both of the tables in question
  2. Steps 1-4 as outlined above
  3. Unlock both of the tables.

The lock would need to be a READ lock (which I think is an Exclusive lock?) so that if another process attempts to do the GET MAX ID query, it will have to wait.

My question is: 1) Is this the best approach? And 2) How does one place an Exclusive lock on a table?

Thanks!

Casey Crookston
  • 13,016
  • 24
  • 107
  • 193
  • The process doesn't work well at all. Moving data processing to the client *adds* delays, it doesn't reduce them. Using MAX(ID) to calculate IDs introduces the risk of duplicate IDs, not only if there are multiple processed but also if some rows get deleted. In that case there won't be any errors but some new values will receive the IDs that were used by *deleted* entries and get related with the wrong records – Panagiotis Kanavos Mar 07 '19 at 15:41
  • 2
    Any time you think you need to lock tables you should consider whether the entire process is broken too. Locks are meant for short-lived operations. They aren't a checkin/checkout mechanism. – Panagiotis Kanavos Mar 07 '19 at 15:42
  • Ok thanks Panagiotis. I hear you. Can I ask how you would do this entire process? – Casey Crookston Mar 07 '19 at 15:43
  • In this particular case you can add primary keys to the staging tables that get their values from SEQUENCE object through a DEFAULT constraint. This way there won't be any risk of duplicates and no need for locking. Bulk inserts will assign unique, incrementing values just as if they were added through an IDENTITY function. The difference is that IDENTITY can work with only one table at a time while SEQUENCE can be used by multiple tables, eg if you insert multiple types of similar records that need a common key once they reach their final destination – Panagiotis Kanavos Mar 07 '19 at 15:46
  • Having both Id columns as non-auto-increment means you should be able to calculate them by some business logic. If they are only there to provide a numbering mechanism for the rows in the table, you better use the built-in identity column. The no-gaps policy can be applied to calculated columns, but not to identity / sequence generated / business-meaningless-numbers columns. – Zohar Peled Mar 07 '19 at 15:46
  • You haven't explained how the rows are related or how duplicates, in the business sense, are identified. There *has* to be a business key in the staging tables which you use right now to find what is related to what. What is it? – Panagiotis Kanavos Mar 07 '19 at 15:49
  • @PanagiotisKanavos, what I don't undersdtand about the process you are describing is how to ensure that the ID's copy from the staging tables to the final tables. Keep in mind that many of the rows inserted into staging tables may end up being duplicate rows and then ignored when the copy happpens. And once the copy is done, we will DELETE all rows in the staging tables to keep their size down – Casey Crookston Mar 07 '19 at 15:49
  • @CaseyCrookston how would you detect the duplicates? Using *which* business key? How would you detect which row in TableA is related to which row in TableB ? – Panagiotis Kanavos Mar 07 '19 at 15:50
  • @PanagiotisKanavos good question. All of this data is being imported from a third party, and the data in both columns has a uniqe id. I left that out becuase I didn't think it was relevant to the overall process. Each table has a ProvideID that is unique to the source data. But we are not using that as our PK becuase we are builind different kind of relationships than what was in the source data. But we use those ProviderId's to weed out duplicates. – Casey Crookston Mar 07 '19 at 15:52
  • Table A is full of truck and auto parts. Table B has descriptions either in English or Japanese. In the source data, which is a CSV, the parts and the descriptions are all 1:1 and many of the descriptions are duplicated. So for TableA we use the PartID to check for duplciated, and in TableB we use the string that is the description. The catch is that lots of the parts are duplcated across the books that we are importing. As we import multiple books, we only want to catalog a part once. – Casey Crookston Mar 07 '19 at 15:56
  • I edited my OP to be more descriptive and to include the columns I left out – Casey Crookston Mar 07 '19 at 16:04
  • @ZoharPeled, you have hit upon the core of our conundrum. How do we compile all of the related ID's and the Heirarchy ID before we know what the ID's will be? And if the ID's don't exist until the moment we insert the data, then how do we verify the integrtiy across two tables? – Casey Crookston Mar 07 '19 at 16:13
  • 1
    The short version: This is a classic case of inserting multiple parents and multiple children at the same transaction. The long version: Read [my answer to a similar case](https://stackoverflow.com/questions/38213008/t-sql-insert-data-into-parent-and-child-tables?noredirect=1&lq=1), draw your similarities, change the proposed solution to fit your exact details, and go make yourself a cup of well earned coffee. – Zohar Peled Mar 07 '19 at 16:21
  • @ZoharPeled, I've read and re-read your answer and I have a good grasp of it all. However, I am still unclear on how to apply it to my instance because of the self-referencing table. I simply am not seeing how to use what you have provided to do a bulk insert into a self-referencing table, UNLESS I first create the entire data model in C# and then do the bulk-insert. But if I go that route, I need a bullet-proof way to get the ID to start from. – Casey Crookston Mar 07 '19 at 16:58
  • @CaseyCrookston Yes, the self-referencing table does appear to pose a problem I failed to recognize at first, but when you think about it again, you can apply the same principle as with two different tables with a parent-child reference - only now instead of a single insert, you have to break your operation to an insert to get the mapping between temp ids and final ids, and then an update. Assuming the `ParentPartId` is nullable you can insert all records with null on this column at first and then update to correct ids using the map. – Zohar Peled Mar 07 '19 at 17:46
  • Thanks @ZoharPeled. As this is a seperate issue, I've created a new thread. https://stackoverflow.com/questions/55049925/bulk-insert-into-a-self-referencing-table. – Casey Crookston Mar 07 '19 at 17:56

1 Answers1

3

I'm not sure in regards to what's the best approach but in terms of placing an 'exclusive' lock on a table, simply using with (TABLOCKX) in your query will put one on the table.

If you wish to learn about it;

https://msdn.microsoft.com/en-GB/library/ms187373.aspx

K.Madden
  • 353
  • 1
  • 16