5

Background

I'm building an application that passes data from a CSV to a MS SQL database. This database is being used as a repository for all my enterprise's records of this type (phone calls). When I run the application, it reads the CSV and converts it to a Pandas dataframe, which I then use SQLAlchemy and pyodbc to append the records to my table in SQL.

However, due to the nature of the content I'm working with, there is oftentimes data that we already have imported to the table. I am looking for a way to check if my primary key exists (a column in my SQL table and in my dataframe) before appending each record to the table.

Current code

# save dataframe to mssql DB engine = sql.create_engine('mssql+pyodbc://CTR-HV-DEVSQL3/MasterCallDb') df.to_sql('Calls', engine, if_exists='append')

Sample data

My CSV is imported as a pandas dataframe (primary key is FileName, its always unique), then passed to MS SQL. This is my dataframe (df):

+---+------------+-------------+
|   |  FileName  |    Name     |
+---+------------+-------------+
| 1 | 123.flac   | Robert      |
| 2 | 456.flac   | Michael     |
| 3 | 789.flac   | Joesph      |
+---+------------+-------------+

Any ideas? Thanks!

rchav9
  • 301
  • 4
  • 13
  • Has a solution been found for this? – KidSudi Jan 29 '15 at 20:45
  • I have stored procedure logic in SQL that has taken care of the issue for me. Basically, I did this: [http://stackoverflow.com/q/4253960/3608301], then run it for each file I import. No terribly efficient, but it works. @KidSudi – rchav9 Jan 30 '15 at 00:34

4 Answers4

4

Assuming you have no memory constraints and you're not inserting null values, you could:

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = pd.concat((df, sql_df)).drop_duplicates(subset=['pk_1', 'pk_2', 'pk_3'], keep=False)
df = df.dropna()
df.to_sql('my_table', con=con, if_exists='append')

Depending on the application you could also reduce the size of sql_df by changing the query.

Update - Better overall and can insert null values:

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = df.loc[df[pks].merge(sql_df[pks], on=pks, how='left', indicator=True)['_merge'] == 'left_only']
# df = df.drop_duplicates(subset=pks) # add it if you want to drop any duplicates that you may insert
df.to_sql('my_table', con=con, if_exists='append')
Lucas
  • 161
  • 1
  • 8
2

What if you iterated through rows DataFrame.iterrows() and then on each iteration used ON DUPLICATE for your key value FileName to not add it again.

Benloper
  • 448
  • 4
  • 13
0

You can check if is empty, like this:

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
    
if sql_df.empty:
    print("Is empty")
else:
     print("Is not empty")
0

you can set parameter index=False see example bellow

data.to_sql('book_details', con = engine, if_exists = 'append', chunksize = 1000, index=False)**

If it is not set, then the command automatically adds the indexcolumn

book_details is the name of the table we want to insert our dataframe into.

Result

[SQL: INSERT INTO book_details (`index`, book_id, title, price) VALUES (%(index)s, %(book_id)s, %(title)s, %(price)s)]
[parameters: ({'index': 0, 'book_id': 55, 'title': 'Programming', 'price': 29},
{'index': 1, 'book_id': 66, 'title': 'Learn', 'price': 23},
{'index': 2, 'book_id': 77, 'title': 'Data Science', 'price': 27})]

Therefore, it needs to be in the table!!!