Most effective way how to avoid inserting duplicates into database?

Question

I will describe my situation first in order to make the following question as clear as possible.

For simplicity, let say I have a table in MySQL database (InnoDB) with records about dogs with structure as follows:

dog_id (PK) | dog_name

And there is 10,000,000 rows in the table (each represents a unique dog) and index build on the dog_name column.

My program searches through vets records that I need to process. Each record is somehow connected with a dog and there is like 100 records for each dog. And I want to find dogs which have not been inserted to the database yet.

That means that 100 times in a row the record that is being processed can be about a dog which is already in the database and therefore the dog doesn't have to be added to the database. But sometimes it happens (as mentioned before 1:100 ratio) that I need to add a dog to the database because it is the first time the program approached a record about the dog. (I hope this example makes my situation clear)

My question is: What is the most effective way how to verify that the dog has not beed inserted into the database yet?

Load all the dog names (suppose all the dogs in the world have unique names) to the memory of the program (a set) and check if the dog is in the set or not. When it is in the set I skip the record, when it is not I insert the dog.
Define the column as UNIQUE and try to insert all the records. When there is a database error because of the uniqueness, I just skip the dog and continue.
Query the database to find out if the dog is in the database every time I process a record and if it is in the database I skip the record and if it is not I insert the dog into the table.

To give you as much information as I can. I use Python, SqlAlchemy, MySQL, InnoDB.

score 1 · Accepted Answer · edited May 23 '17 at 12:28

1

You should use dog_name as the primary key, and then use

INSERT INTO dogs (dog_name) VALUES ('[NAME HERE]') ON DUPLICATE KEY UPDATE dog_name='[NAME HERE]';

This will only insert unique dog names. If you still want to use a numerical ID for each dog, you can set that column to auto increment, but the primary key should be the dog names (assuming all are unique).

SQLAlchemy does not have this functionality built in, but can make force it to make a similar query with session.merge().

edited May 23 '17 at 12:28

Community

1
1

answered Jan 22 '14 at 20:08

sundance

2,905
4
21
31

I haven't known "ON DUPLICATE KEY" syntax before. Thanks. However, I guess "INSERT IGNORE INTO" would be useful rather in my case. Question is, wouldn't be SELECT + INSERT faster? Considering the ratio 100:1 (only SELECTs vs SELECTs followed by INSERT) – Marek Jan 22 '14 at 20:27
1

I wouldn't recommend using `INSERT IGNORE`, as that will ignore *any* error in the query, including a duplicate key. Doing a `SELECT` and then an `INSERT` will mix some Python in with the SQL, which isn't optimized and may be slower than just doing `INSERT...ON DUPLICATE KEY`/`INSERT IGNORE` (pure SQL). So I'd recommend INSERT...ON DUPLICATE KEY. – sundance Jan 22 '14 at 21:51

Channing Moore · Answer 2 · 2014-01-23T22:20:26.037

Something like option 2 or option 3 will work best; they should take similar amounts of time, and which one wins will depend on exactly how MySQL/InnoDB decides that a collision has occurred. I don't actually know; it's possible that insert with a UNIQUE key triggers the same operation as a SELECT. Prototype both and profile performance.

If performance is an issue, you can always hand-code the SELECT statement since it's relatively simple. This cuts out the Python MySQL overhead to construct the SQL; that's normally not a huge issue, but SQLAlchemy can add dozens of layers of function calls that support its ability to construct arbitrary queries. You can short-circuit those calls using Python string formatting.

Assuming that 's' is your SQLAlchemy Session object:

def dog_in_db(dog_name):
    q = 'SELECT COUNT (*) FROM dogs WHERE dog_name = %s;' % dog_name
    res = s.execute(q)
    return res.first()[0] > 0

You could also try a SELECTing and check whether any rows are returned:

    q 'SELECT dog_id FROM dogs WHERE dog_name = %s;' % dog_name
    res = s.execute(q)
    return res.rowcount() > 0

Assuming that your option 1 means loading all of the names from the database, it will be slow. MySQL will always perform any single operation it supports faster than Python can; and what you're doing here is exactly the same single operation (finding a member in a list).

Thanks. Just two things: 1) Could you be more specific - what do you mean by "_hard-code the SELECT statement_"? 2) After further research on the idea you are suggesting, I found that the best performing `SELECT` would be `SELECT COUNT(1) FROM dogs WHERE dog_name = %s LIMIT 1`. What do you think about it? — Marek, Jan 23 '14 at 20:51
I think I mistyped that: I meant "hand-code". Sounds like you handled it gracefully. And that sounds like a very reasonable bit of SQL. I'm guessing that the `LIMIT 1` bit speeds up the SQLAlchemy Result parsing. Sounds like `COUNT(1)` is the same as `COUNT(*)` at least for MySQL: http://stackoverflow.com/questions/1221559/count-vs-count1 — Channing Moore, Jan 23 '14 at 22:18

Most effective way how to avoid inserting duplicates into database?

2 Answers2