3

This question is part of a much more complex problem that I am breaking down into smaller chunks (for my sanity).

Let's say I have a Parts table that is self-referencing and looks something like this:

PartId  ParentPartId  Description             PartNumber   IsCatHeader     ProviderId
---------------------------------------------------------------------------------------
9292       null       'Engine Parts'           null           1      'Engine Parts||1'
9293       9292       'Engine Bolts'           null           1      'Engine Bolts||1'
9294       9293       '6mm Engine Bolt'        'X1-234-ABC'     0      '6mm Engine Bolt|X1-234-ABC|0'
9295       9293       '5mm Engine Bolt'        'X2-934-BCD'     0      '5mm Engine Bolt|X2-934-BCD|0'
9296       9295       '5mm Engine Bolt Washer' 'X2-934-GED'     0      '5mm Engine Bolt Washer|X2-934-GED|0'

You get the idea. Now... we are importing entire books (massive CSV files) of these parts, to the tune of several hundred line items in a single book.

Parts are often duplicated across books, and part of our job is to keep duplicates out of the database.

The source does not provide any kind of unique id for these parts, so we have created a ProviderId column that is a collection of data parts from each record that creates a unique string. We can then use this to check for duplicates as we do the import. (The actual data in this column is more complex than what I've shown here.)

So, now to my problem. I am trying to figure out the best way to do this in bulk. One option (NOT a good one) is to cycle through each item one at a time from the C# application.... insert a parent, get the SCOPE IDENTITY, insert all children, etc. Yuck. In a large book, this would result in thousands of DB calls per book. Not an option.

We need a bulk insert solution. But we have a real conundrum with the self-referencing aspect of this.

Our original thought was to build the ENTIRE data model in C#, including all of the PartId's and ParentPartId's. Then bulk insert directly into the Parts table. The problem with this, however, is knowing what ID to start with. Keep in mind, multiple processes will be running simultaneously, and many of the parts will be duplicates. We tried using a SEQUENCE object, but that presented problems... it's 100% possible that duplicate books will be processed, which would result in giant gaps in the ID's if we use a SEQUENCE.

The course I am chasing now is this... We have created a Parts_Staging table that looks almost like the actual Parts table. And we can do a bulk insert to it, no problem. Then it's a simple query to use the ProviderId column to look for records in Parts_Staging that don't exist in Parts and move them over.

But with this path, I am not creative (or experienced) enough to imagine a way to do this move / merge and to keep the self-referencing id's intact.

I've been reading threads like "How to Insert data into self reference table in sql server?" and "T-SQL - Insert Data into Parent and Child Tables" but so far I am still not seeing the vision.

Casey Crookston
  • 13,016
  • 24
  • 107
  • 193
  • Is it really "an issue" if there are gaps in IDs? (If so, how are deletes handled?) Having a dense-packing requirement usually means there is some .. odd coupling. – user2864740 Mar 07 '19 at 17:55
  • Anyway, since already having the staging the table, this question really comes down to 'how to do a bulk merge-insert' (with atomic considerations)? – user2864740 Mar 07 '19 at 17:58
  • @user2864740, no, it's not a fatal flaw. The only drawback is that if we process several hundred thousand rows, and then claim that many numbers from the sequence, then we have massive gaps. We could make the id columns BIGINTS to accomidate, but if we can avoid this issue, we'd rather. – Casey Crookston Mar 07 '19 at 17:58
  • Several hundred thousand. Hahahh. Hahahahaha. :D OTOH, we have run out of INT-32 space .. at millions/day D: – user2864740 Mar 07 '19 at 17:59
  • 1
    @user2864740, yes good point! "this question really comes down to 'how to do a bulk merge-upsert'? – Casey Crookston Mar 07 '19 at 17:59
  • I'm lost as to why that's funny... (scratching my head) – Casey Crookston Mar 07 '19 at 18:00
  • It's such a tiny number :) That said, I'd consider opting for INT-64 (or even, uhg, GUIDs) if these hundreds of thousands are done to the tune of millions/day.. I don't know the use-case though. This tangential comment is largely because I'm currently jaded about having to deal with a crummy schema now, where exhaustion must be handled within a year :} – user2864740 Mar 07 '19 at 18:00
  • if it was only several hundred thousand once, then no problem. But when that gets repeated time and time again, it becomes a big number. And yes, running out of Int32's is a real concern. So... legit question... if we make the ID fields a BIGINT (Int64 or Long in C#), would you personally be okay with the gaps in id's? – Casey Crookston Mar 07 '19 at 18:03
  • Absolutely - just project the ID space usage / time. – user2864740 Mar 07 '19 at 18:04
  • 1
    My question is: What's the logic of determining which items gets to be parent or child of another item? – alans Mar 07 '19 at 18:04
  • @alans, that's all done ahead of time in C#. The "book" that we import has the Part (example: Q4.123-ABC-XYD) and the Parent Part. We are using that to build a List. – Casey Crookston Mar 07 '19 at 18:06
  • @user2864740, ok, thank you. Maybe the gaps in id's is not the issue I thought it was. – Casey Crookston Mar 07 '19 at 18:07
  • @CaseyCrookston Ok. So another question, is there a way to identify each batch that you process? Perhaps a unique identifier? – alans Mar 07 '19 at 18:24
  • @alans Yes! We just this morning decided to add a column, `SequeneKey` which is a GUID that will be unique per bulk upload. – Casey Crookston Mar 07 '19 at 18:30

1 Answers1

3

Ok. Here's what I would do. First, make up the sequence in your collection in C# along with it's parent/child relationship. But I would put that in a different column, maybe something like BatchPartId and BatchParentPartId. (Maybe a different associated table. It doesn't matter.)

| PartId | ParentPartId | BatchPartId | BatchParentPartId |
|--------|--------------|-------------|-------------------|
|        |              | XX1901      |                   |
|        |              | XX1902      | XX1901            |
|        |              | XX1903      | XX1901            |
|        |              | XX1904      | XX1903            |
|        |              | XX1905      | XX1903            |

Then, insert the entire list, allowing the PartId to get created.

| PartId | ParentPartId | BatchPartId | BatchParentPartId |
|--------|--------------|-------------|-------------------|
| 55     |              | XX1901      |                   |
| 56     |              | XX1902      | XX1901            |
| 57     |              | XX1903      | XX1901            |
| 58     |              | XX1904      | XX1903            |
| 59     |              | XX1905      | XX1903            |

At post, this is where you fill in the ParentPartId with the PartId of the row with the corresponding BatchPartId from BatchParentPartId of the child row.

| PartId | ParentPartId | BatchPartId | BatchParentPartId |
|--------|--------------|-------------|-------------------|
| 55     |              | XX1901      |                   |
| 56     | 55           | XX1902      | XX1901            |
| 57     | 55           | XX1903      | XX1901            |
| 58     | 57           | XX1904      | XX1903            |
| 59     | 57           | XX1905      | XX1903            |
alans
  • 1,022
  • 9
  • 17
  • ok, Thanks Alan! I'm going to prod you for a little more detail, if that's okay. 1) When you say, "First, make up the sequence in your collection in C# "... by "sequence" do you mean a SQL sequence? Or are you talking about a List? 2) What exactly does "At post" mean? Where in the process are you? And are you using a _staging table? Or going direct to the actual table? – Casey Crookston Mar 07 '19 at 18:49
  • You make up the sequence but make it unique to that batch like a combination of batch number and start with 1. – alans Mar 07 '19 at 18:51
  • ah ok! I'm tracking with you now! – Casey Crookston Mar 07 '19 at 19:12
  • So two queries. One to do the bulk insert, then another to back-fill the ParentPartId – Casey Crookston Mar 07 '19 at 19:12
  • 1
    Yep, that just what I suggested in my last comment to the [much more complex problem](https://stackoverflow.com/questions/55047552/how-to-lock-and-unlock-a-sql-server-table?noredirect=1#comment96849842_55047552) – Zohar Peled Mar 08 '19 at 08:21