0

I have a simple task that it is working but it is taking way to long to finish. The task consists in retrieve data from an api (json file) and compare if that data is already on the database or not.

If the data is on database = update item, else =insert.

I do this by looping through a set of current item_id on the database and executing the statement for each scenario. For example:

For item in data_json:

   if item[item_id] in setid_db:
    
      cursor.execute(update...)

   else:
    
      cursor.execute(insert...)

Already tried fast_executemany, but was nearly the same speed.

On other task that I just needed to insert all the data at once from a dataframe, so I used df.to_sql with method=multi and chunksize = 50 which worked fine and faster. Just don't know yet how to do the equivalent for the current task. Any help will be appreciated.

PS.: After the first load of data from the API (say 30k items) to the database, it may be that on the next load there are only 200 more data to be updated or insert. So, using df.to_sql for these scenarios will overwrite all existing data in the database. That's why I need to execute those statements and make than faster.

collected
  • 15
  • 5

1 Answers1

1

Perhaps you can preprocess your data_json into two groups: one that needs to be inserted (does not already exist in the database) and one that needs to be updated (already exists in the database). Then you can use your df.to_sql call with method="multi" and chunksize=50. You may be able to convert your data_json directly into a Pandas DataFrame with pandas.DataFrame(data_json) (here I am assuming data_json is a Python object, probably a dictionary, instead of a string in JSON format).

If you can convert your data_json into a data frame first you may be able to run

df = pd.DataFrame(data_json)
df.loc[lambda D: D.item_id.isin(setid_db), :].to_sql(<UPDATE>)
df.loc[lambda D: ~D.item_id.isin(setid_db), :].to_sql(<INSERT>)
ogdenkev
  • 2,264
  • 1
  • 10
  • 19
  • The only problem, which I forgot to mention, is that after the first load of data from the API (say 30k items) to the database, it may be that on the next load there are only 200 more data to be updated or insert. So, using df.to_sql for these scenarios will overwrite all existing data in the database. That's why I'm using cursor.execute() – collected Jul 16 '22 at 13:01
  • OK. Pandas' [`DataFrame.to_sql`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql) method takes a `if_exists` parameter, which has an option `append` that will add data to the table if the table already exists. That would work for new data that needs to be INSERTed. I'm not sure if there's a way to update already existing data with I'm not sure `to_sql()` has a way to update data. – ogdenkev Jul 16 '22 at 15:58
  • Thanks for you help! It turned out that the error refers to some None values present in the file that I want to use to perform the update, I will work on it! – collected Jul 17 '22 at 01:07