38

I have a SQL server table RealEstate with columns - Id, Property, Property_Value. This table has about 5-10 million rows and can increase even more in the future. I want to insert a row only if a combination of Id, Property, Property_Value does not exist in this table.

Example Table -

1,Rooms,5
1,Bath,2
1,Address,New York
2,Rooms,2
2,Bath,1
2,Address,Miami

Inserting 2,Address,Miami should NOT be allowed. But, 2,Price,2billion is okay. I am curious to know which is the "best" way to do this and why. The why part is most important to me. The two ways of checking are -

  1. At application level - The app should check if a row exists before it inserts a row.
  2. At database level - Set unique constraints on all 3 columns and let the database do the checking instead of person/app.

Is there any scenario where one would be better than the other ?

Thanks.

PS: I know there is a similar question already, but it does not answer my problem - Unique constraint vs pre checking Also, I think that UNIQUE is applicable to all databases, so I don't think I should remove the mysql and oracle tags.

Steam
  • 9,368
  • 27
  • 83
  • 122
  • 1
    Always go for option #2. – Wagner DosAnjos Feb 19 '14 at 19:07
  • @wdosanjos - please tell me why. – Steam Feb 19 '14 at 19:09
  • 10
    The second method is better because it is the only one of the two guaranteed to work, there is a chance of a [race condition](http://en.wikipedia.org/wiki/Race_condition) if you use the check before insert method. There is a timelapse, no matter how small between the check and the insert, and in this gap the record could have been inserted by another thread. I think the only way to avoid this is to use `MERGE` along with `HOLDLOCK` (Specific to sql server). Even then there is no reason not to have the constraint. You could always do both? – GarethD Feb 19 '14 at 19:15
  • @GarethD - Thanks. I got another link which explains race condition, but it does not mention how it can be a problem in my case - http://www.celticwolf.com/blog/2010/04/27/what-is-a-race-condition/ – Steam Feb 19 '14 at 19:19
  • @GarethD - Race condition is a concern only when more than one app will access/modify the database. In my case, there could be one more, but I am not sure. Can you put your comment as an answer, perhaps with some more explanation or code ? Thanks. – Steam Feb 19 '14 at 19:22
  • 2
    There is not a lot more to say, if you want an answer that is not specific to a DBMS then I cannot add more than I have already said. Use constraints no matter how you do the insert, then you can **never** violate the integrity of the table. If you want more information specifically about avoiding the race condition in SQL Server (2008+) then read [this article](http://weblogs.sqlteam.com/dang/archive/2009/01/31/UPSERT-Race-Condition-With-MERGE.aspx) about using MERGE with HOLDLOCK. – GarethD Feb 19 '14 at 19:51
  • 1
    You simply can't do it on application level with concurrent transactions. Two concurrent transactions can still insert duplicate values if that is not prevented at database level. The check before in the application isn't more expensive than the check the database does for a UNIQUE constraint - but it won't be correct in a multi-user environment. –  Mar 04 '14 at 19:53
  • As the link @Steam posted has rotted, here is a question on Stack Overflow that [explains Race Conditions](http://stackoverflow.com/questions/34510/what-is-a-race-condition) – Toni Leigh Jan 25 '16 at 13:25

4 Answers4

21

I think it most cases the differences between that two are going to be small enough that the choice should mostly be driven by picking the implementation that ends up being most understandable to someone looking at the code for the first time.

However, I think exception handling has a few small advantages:

  • Exception handling avoids a potential race condition. The 'check, then insert' method might fail if another process inserts a record between your check and your insert. So, even if you're doing 'check then insert' you still want exception handling on the insert and if you're already doing exception handling anyways then you might as well do away with the initial check.

  • If your code is not a stored procedure and has to interact with the database via the network (i.e. the application and the db are not on the same box), then you want to avoid having two separate network calls (one for the check and the other for the insert) and doing it via exception handling provides a straightforward way of handling the whole thing with a single network call. Now, there are tons of ways to do the 'check then insert' method while still avoiding the second network call, but simply catching the exception is likely to be the simplest way to go about it.

On the other hand, exception handling requires a unique constraint (which is really a unique index), which comes with a performance tradeoff:

  • Creating a unique constraint will be slow on very large tables and it will cause a performance hit on every single insert to that table. On truly large databases you also have to budget for the extra disk space consumed by the unique index used to enforce the constraint.
  • On the other hand, it might make selecting from the table faster if your queries can take advantage of that index.

I'd also note that if you're in a situation where what you actually want to do is 'update else insert' (i.e. if a record with the unique value already exists then you want to update that record, else you insert a new record) then what you actually want to use is your particular database's UPSERT method, if it has one. For SQL Server and Oracle, this would be a MERGE statement.

ivanatpr
  • 1,862
  • 14
  • 18
  • Thanks. I am not sure if the performance penalty will be the same for UNIQUE and INSERT with "pre-check". Is it going to be different ? If yes, how different ? – Steam Feb 20 '14 at 00:24
  • In one particular case (clustered SQL Server using a SAN for storage) my experience says the check-then-insert performance might be significantly worse. The multiple hops seem to really affect the overall timing in repetitive calls. We have a chatty application and wall-time run length is significantly worse in some parts of our app for a couple of customers who use that configuration. (copper vs fiber, slow or poorly-configured NIC, who knows) – DaveE Feb 25 '14 at 15:25
  • 4
    I don't think the differences are "small". The difference is huge: you simply can't guarantee uniqueness from the application unless you do a complete exclusive *write* lock on the table which means the solution inside the application is either non-scalable or simply not working. Both aren't really good options. –  Mar 04 '14 at 19:55
  • 1
    @DaveE We have app servers that are separate from db servers (also on SAN and very well equipped to deal with multiple concurrent db requests) and my experience mirrors your own. If you're dealing with a slow app that's doing multiple separate db queries for each page load, then the very first thing you should try is changing that up so that it executes everything either in parallel or as a single jdbc batch. The individual overhead of the network hops when you execute everything serially really start to add up. – ivanatpr Mar 04 '14 at 19:56
  • Unique column makes error "duplicate entry on ...", how to avoid it without precheck? – Hector Dec 15 '15 at 08:31
7

Dependent on the cost of #1 (doing a lookup) being reasonable, I would do both. At least, in Oracle, which is the database I have the most experience with.

Rationale:

  • Unique/primary keys should be a core part of your data model design, I can't see any reason to not implement them - if you have so much data that performance suffers from maintaining the unique index:
    • that's a lot of data
    • partition it or archive it away from your OLTP work
  • The more constraints you have, the safer your data is against application logic errors.
  • If you check that a row exists first, you can easily extract other information from that row to use as part of an error message, or otherwise fork the application logic to cope with the duplication.
  • In Oracle, rolling back DML statements is relatively expensive because Oracle expects to succeed (i.e. COMMIT changes that have been written) by default.
Ben
  • 1,902
  • 17
  • 17
  • "Dependent on the cost of #1 (doing a lookup) being reasonable" - How do I decide if the cost is reasonable ? My table has 5 -10 million rows and is growing at about 10K per month or so. I am not sure if I should be doing both pre-insert check and unique constraint as well given the huge number of rows. – Steam Feb 20 '14 at 00:27
  • Two methods in Oracle, I'm sure it's similar in other databases. 1. Try it and see how long it takes. 2. You could look at the query plan to see how it will be executed. If you have a unique index, then the lookup will almost certainly use it. I'd normally expect that to be efficient. – Ben Feb 20 '14 at 07:04
  • 1
    The cost of both ways of checking will likely be similar, as it'll be doing the same thing fundamentally. – Ben Feb 20 '14 at 07:07
  • Thanks Ben. Is there a way to compare the costs of the two ? In case we have missed something or if the RDBMS has a quirk or but, it might be possible that the cost of the two methods turns out to be different. – Steam Feb 21 '14 at 20:44
  • 1
    I think the only real way to compare is to run a benchmark and time the results. It's possible to estimate the cost of a SELECT statement - Oracle provides the Explain Plan feature for this purpose, but I'm not sure if it'll incorporate the cost of checking a unique index. It probably does. – Ben Feb 22 '14 at 12:29
  • SQL Server's version of Explain Plan does include the relative cost of the index lookup and will specifically let you know if a table scan is involved. – DaveE Feb 25 '14 at 15:16
  • Thanks for mentioning the cost of rolling back when it is expected to succeed. – Andrew Cotton Feb 28 '18 at 20:22
4

This does not answer the question directly, but I thought it might be helpful to post it here since its better than wikipedia and the link might just become dead someday.

Link - http://www.celticwolf.com/blog/2010/04/27/what-is-a-race-condition/

Wikipedia has a good description of a race condition, but it’s hard to follow if you don’t understand the basics of programming. I’m going to try to explain it in less technical terms, using the example of generating an identifier as described above. I’ll also use analogies to human activities to try to convey the ideas.

A race condition is when two or more programs (or independent parts of a single program) all try to acquire some resource at the same time, resulting in an incorrect answer or conflict. This resource can be information, like the next available appointment time, or it can be exclusive access to something, like a spreadsheet. If you’ve ever used Microsoft Excel to edit a document on a shared drive, you’ve probably had the experience of being told by Excel that someone else was already editing the spreadsheet. This error message is Excel’s way of handling the potential race condition gracefully and preventing errors.

A common task for programs is to identify the next available value of some sort and then assign it. This technique is used for invoice numbers, student IDs, etc. It’s an old problem that has been solved before. One of the most common solutions is to allow the database that is storing the data to generate the number. There are other solutions, and they all have their strengths and weaknesses.

Unfortunately, programmers who are ignorant of this area or simply bad at programming frequently try to roll their own. The smart ones discover quickly that it’s a much more complex problem than it seems and look for existing solutions. The bad ones never see the problem or, once they do, insist on making their unworkable solution ever more complex without fixing the error. Let’s take the example of a student ID. The neophyte programmer says “to know what the next student number should be, we’ll just get the last student number and increment it.” Here’s what happens under the hood:

  1. Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school’s network, but has no way to talk to other copies of the program running on other PCs.
  2. Betty creates a new student record for Bob Smith, entering all of the information.
  3. While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
  4. George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
  5. Betty’s program connects to the database server and gets the highest student number in use, 5012.
  6. George’s program, at the same time, gets the same answer to the same question.
  7. Both programs decide that the new student ID for the record that they’re saving should be 5013. They add that information to the record and then save it in the database.
  8. Now Bob Smith (Betty’s student) and Gina Verde (George’s student) have the same student ID.

This student ID will be attached to all sorts of other records, from grades to meal cards for the dining hall. Eventually this problem will come to light and someone will have to spend a lot of time assigning one of them a new ID and sorting out the mixed-up records.

When I describe this problem to people, the usual reaction is “But how often will that happen in practice? Never, right?”. Wrong. First, when data entry is being done by your staff, it’s generally done during a relatively small period of time by everyone. This increases the chances of an overlap. If the application in question is a web application open to the general public, the chances of two people hitting the “Save” button at the same time are even higher. I saw this in a production system recently. It was a web application in public beta. The usage rate was quite low, with only a few people signing up every day. Nevertheless, six pairs of people managed to get identical IDs over the space of a few months. In case you’re wondering, no, neither I nor anyone from my team wrote that code. We were quite surprised, however, at how many times that problem occurred. In hindsight, we shouldn’t have been. It’s really a simple application of Murphy’s Law.

How can this problem be avoided? The easiest way is to use an existing solution to the problem that has been well tested. All of the major databases (MS SQL Server, Oracle, MySQL, PostgreSQL, etc.) have a way to increment numbers without creating duplicates. MS SQL server calls it an “identity” column, while MySQL calls it an “auto number” column, but the function is the same. Whenever you insert a new record, a new identifier is automatically created and is guaranteed to be unique. This would change the above scenario as follows:

  1. Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school’s network, but has no way to talk to other copies of the program running on other PCs.
  2. Betty creates a new student record for Bob Smith, entering all of the information.
  3. While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
  4. George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
  5. Betty’s program connects to the database server and hands it the record to be saved.
  6. George’s program, at the same time, hands over the other record to be saved.
  7. The database server puts both records into a queue and saves them one at a time, assigning the next available number to them.
  8. Now Bob Smith (Betty’s student) gets ID 5013 and Gina Verde (George’s student) gets id 5014.

With this solution, there is no problem with duplication. The code that does this for each database server has been tested repeatedly over the years, both by the manufacturer and by users. Millions of applications around the world rely on it and continue to stress test it every day. Can anyone say the same about their homegrown solution?

There is at least one well tested way to create identifiers in the software rather than in the database: uuids (Universally Unique Identifiers). However, a uuid takes the form of xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where “x” stands for a hexadecimal digit (0-9 and a-f). Do you want to use that for an invoice number, student ID or some other identifier seen by the public? Probably not.

To summarize, a race condition occurs when two programs, or two independent parts of a program, attempt to access some information or access a resource at the same time, resulting in an error, be it an incorrect calculation, a duplicated identifier or conflicting access to a resource. There are many more types of race conditions than I’ve presented here and they affect many other areas of software and hardware.

Steam
  • 9,368
  • 27
  • 83
  • 122
0

The description of your problem is exactly why primary keys can be compound, e.g., they consist of multiple fields. That way, the database will handle the uniqueness for you, and you don't need to care about it.

In your case, the table definition could be something similar to the following like:

 CREATE TABLE `real_estate` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `property` varchar(255) DEFAULT NULL,
   `property_value` varchar(255) DEFAULT NULL,
   PRIMARY KEY (`id`),
   UNIQUE KEY `index_id_property_property_value` (`id`, `property`, `property_value`),
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8; 
mjuarez
  • 16,372
  • 11
  • 56
  • 73
  • 3
    Your unique key is fairly pointless, `id` is by definition unique since it is the primary key, since `id` is unique any combination of `id`, `property`, `property_value` will be unique too. – GarethD Feb 19 '14 at 19:42