0

Rationale

I created an ETL in Python that downloads some data from Salesforce and loads them in SQL Server.
Those records are related to Salesforce Objects like Case, EmailMessage, RecordType, User, etc...
Those records also have a specific ID (varchar(18)) that comes from Salesforce and it's used in the DB to JOIN (for example EmailMessage to Case) and it's also currently used as a PK.

Question

Will it make more sense, from the perspective of performance and best practices, to change the PK to an IDENTITY column and just indexing the "salesforce id" columns and treat them just for JOINing?

Thom A
  • 88,727
  • 11
  • 45
  • 75
tidus4400
  • 56
  • 1
  • 6
  • After the data is loaded, will your app start adding more "live" data to the database, or will the loaded data be refreshed periodically, if ever (and your app doesn't add live data)? – The Impaler Oct 15 '21 at 14:59
  • @TheImpaler the data are refreshed periodically (let's say 6 times a day at fixed times) and are meant to be a 1:1 copy of Salesforce's data (to make reports and similar). No data is ever inserted live by users (as they use the real Salesforce to do that) – tidus4400 Oct 15 '21 at 15:03
  • 1
    I flagged this question as a duplicate (of https://stackoverflow.com/questions/590442/deciding-between-an-artificial-primary-key-and-a-natural-key-for-a-products-table) - in my own experience, and as per the answers there, you will never regret adding a new 'artificial' primary key, and you will often eventually regret using existing data as a primary key! – MikeBeaton Oct 15 '21 at 15:06
  • If we are talking about mirrored data the general rule is that you don't really need an extra PK; there are exceptions, though, but it doesn't seem to be your case. As a side note, if you are performing high performance queries on it, I would strongly suggest you prepare indexes according to your tailored queries; the ETL tool [most likely] won't automatically add them. – The Impaler Oct 15 '21 at 15:07
  • The exception to my rule is "bad unreliable data" and that falls into the category that @MikeBeaton is considering. If you don't trust the data quality, it doesn't hurt to have an extra PK, just in case. – The Impaler Oct 15 '21 at 15:09
  • @MikeBeaton I saw that post, thanks for confirming the validity of that answer then :) – tidus4400 Oct 15 '21 at 15:09
  • @TheImpaler sure thing about indexes, thanks for the heads up! – tidus4400 Oct 15 '21 at 15:15
  • For what it's worth, we just take the data from SFDC to SQL Server without a technical PK and just use SFDC's ID, with no issues. Adding a technical PK is unnecessary overhead. – HardCode Oct 15 '21 at 15:29
  • @MikeBeaton the linked SO question's answers speak more about natural PKs. In the case of SFDC, they aren't natural keys, rather, technical keys albeit in an alphanumeric format. There's no risk of clash for an SFDC object ID, else all of SFDC would have a major issue. – HardCode Oct 15 '21 at 15:33
  • @HardCode that's exactly what I am doing in my ETLs since 2018 (just that I use Python or c# instead of the CLI) but recently the topic arised so I wanted to hear some unbiased opinions. Thanks a lot for this! – tidus4400 Oct 15 '21 at 15:53

1 Answers1

0

You can set a PK for the same column than Salesforce in your own table. But when you insert your data, use SET IDENTITY_INSERT ON to write into the PK.

Hazrelle
  • 758
  • 5
  • 9