how to avoid duplicates while inserting data to database

Question

Having issues with duplicates while inserting values from excel to sql database. in my database I have a table sanction with 15 columns. my query:

INSERT INTO sanction (
                id,
                organization_type, organization, date,
                decision_number, penalty_type, penalty_way,
                penalty, violation, execution_period,
                article, note, type_npa,
                department, uploaded_date)
                VALUES (
                null,
                   %s, %s, STR_TO_DATE(%s,"%%d.%%m.%%Y"),
                %s, %s, %s,
                %s, %s, STR_TO_DATE(%s,"%%d.%%m.%%Y"),
                %s, %s, %s,
                %s, %s)

Tried if not exist, unique didn't help. can somebody help me with query? Want to use it without creating temporary table, and uniques. thanks!

@antonio_veneroso excel is updating every day. and file combined with new and old data. (new one appends in everyday basis) — Yerbol, Feb 27 '19 at 19:59
Use either a `PRIMARY` or `UNIQUE` column or a `CONSTRAINT` of multiple columns. Relevant [table-constraint](https://sqlite.org/syntax/table-constraint.html), [what-does-keyword-constraint-do-in-this-create-table-statement](https://stackoverflow.com/questions/1179352/what-does-keyword-constraint-do-in-this-create-table-statement) — stovfl, Feb 27 '19 at 20:42
In the excel, how do you know if it's a new row? Do you have a unique field? Would be great if you have a unique field, that way you just create a unique field in your database table. When you try to insert it will fail, because the data is already in it. — Antonio Veneroso Contreras, Feb 27 '19 at 22:56

score 0 · Accepted Answer · answered Mar 11 '19 at 21:29

I'm pretty sure you will NOT know if the data in the data frame dupes some of the records in the database, at least until after you push the data set into the database. If you want to de-dupe a data-frame, you can do it like this.

df.drop_duplicates()

import pandas as pd
raw_data = {'first_name': ['Jason', 'Jason', 'Jason','Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Miller', 'Miller','Ali', 'Milner', 'Cooze'], 
        'age': [42, 42, 42, 36, 24, 73], 
        'preTestScore': [4, 4, 4, 31, 2, 3],
        'postTestScore': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
print(df)


df = df.drop_duplicates(['last_name'], keep='first')
print(df)

Or, test for dupes DURING the insert. I found a couple samples on SO, which may help you. If you are using SQL Server...

Using NOT EXISTS:

INSERT INTO TABLE_2
  (id, name)
SELECT t1.id,
       t1.name
  FROM TABLE_1 t1
 WHERE NOT EXISTS(SELECT id
                    FROM TABLE_2 t2
                   WHERE t2.id = t1.id)

Using NOT IN:

INSERT INTO TABLE_2
  (id, name)
SELECT t1.id,
       t1.name
  FROM TABLE_1 t1
 WHERE t1.id NOT IN (SELECT id
                       FROM TABLE_2)

Avoid duplicates in INSERT INTO SELECT query in SQL Server

Or, consider using VBA for this kind of thing, if you are already working with Excel.

how to avoid duplicates while inserting data to database

1 Answers1