I will describe my situation first in order to make the following question as clear as possible.
For simplicity, let say I have a table in MySQL database (InnoDB) with records about dogs with structure as follows:
dog_id (PK) | dog_name
And there is 10,000,000 rows in the table (each represents a unique dog) and index build on the dog_name column.
My program searches through vets records that I need to process. Each record is somehow connected with a dog and there is like 100 records for each dog. And I want to find dogs which have not been inserted to the database yet.
That means that 100 times in a row the record that is being processed can be about a dog which is already in the database and therefore the dog doesn't have to be added to the database. But sometimes it happens (as mentioned before 1:100 ratio) that I need to add a dog to the database because it is the first time the program approached a record about the dog. (I hope this example makes my situation clear)
My question is: What is the most effective way how to verify that the dog has not beed inserted into the database yet?
- Load all the dog names (suppose all the dogs in the world have unique names) to the memory of the program (a set) and check if the dog is in the set or not. When it is in the set I skip the record, when it is not I insert the dog.
- Define the column as UNIQUE and try to insert all the records. When there is a database error because of the uniqueness, I just skip the dog and continue.
- Query the database to find out if the dog is in the database every time I process a record and if it is in the database I skip the record and if it is not I insert the dog into the table.
To give you as much information as I can. I use Python, SqlAlchemy, MySQL, InnoDB.