2

I have a Django model that records transactions. I need to update only some of the fields (two) of some of the transactions.

In order to update, the user is asked to provide additional data and I use pandas to make calculations using this extra data.

I use the output from the pandas script to update the original model like this:

for i in df.tnsx_uuid:
    t = Transactions.objects.get(tnsx_uuid=i)
    t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
    t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
    t.save()

this is very slow. What is the best way to do this?

UPDATE: after some more research, I found bulk_update and changed the code to:

transactions = Transactions.objects.select_for_update()\
        .filter(tnsx_uuid__in=list(df.tnsx_uuid)).only('start_bal', 'end_bal')
for t in transactions:
    i = t.tnsx_uuid
    t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
    t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
Transactions.objects.bulk_update(transactions, ['start_bal', 'end_bal'])

this has approximately halved the time required.

How can I improve performance further?

John
  • 949
  • 1
  • 9
  • 20

2 Answers2

0

I have been looking for the answer to this question and haven't found any authoritative, idiomatic solutions. So, here's what I've settled on for my own use:

transaction = Transactions.objects.filter(tnsx_uuid__in=list(df.tnsx_uuid))
# Build a DataFrame of Django model instances
trans_df = pd.DataFrame([{'tnsx_uuid': t.tnsx_uuid, 'object': t} for t in transactions])
# Join the Django instances to the main DataFrame on the index
df = df.join(trans_df.set_index('tnsx_uuid'))

for obj, start_bal, end_bal in zip(df['object'], df['start_bal'], df['end_bal']):
    obj.start_bal = start_bal
    obj.end_bal = send_bal

Transactions.objects.bulk_update(df['object'], ['start_bal', 'end_bal'])

I don't know how DataFrame.loc[] is implemented but it could be slow if it needs to search the whole DataFrame for each use rather than just do a hash lookup. For that reason and to just simply things by doing a single iteration loop, I pulled all of the model instances into df and then used the recommendation from a Stackoverflow answer on iterating over a DataFrames to loop over the zipped columns of interest.

I looked at the documentation for select_for_update in Django and it isn't apparent to me that it offers a performance improvement, but you may be using it to lock the transaction and make all of the changes atomically. Per the documentation, bulk_update should be faster than saving each object individually.

In my case, I'm only updating 3500 items. I did some timing of the various steps and came up with the following:

  • 3.05 s to query and build the DataFrame
  • 2.79 ms to join the instances to df
  • 5.79 ms to run the for loop and update the instances
  • 1.21 s to bulk_update the changes

So, I think you would need to profile your code to see what is actually taking time, but it is likely a Django issue rather than a Pandas issue.

0

I kind of face the same issue (almost same quantity of records 3500~), and I will like to add:

  • bulk_update seems to be a lot worse in performance than a bulk_create, in my case deleting objects was allowed, so instead of bulk_updating, I delete all objects, and then recreate them.
  • I used the same approach as you (thanks for the idea), but with some modifications:

a) I create the dataframe from the query itself:

all_objects_values = all_objects.values('id', 'date', 'amount')
self.df_values = pd.DataFrame.from_records(all_objects_values )

b) Then I create the column of objects without iterating (I make sure these are ordered):

self.df_values['object'] = list(all_objects)

c) For updating object values (after operations made in my dataframe), I iterate rows(not sure about performance difference):

 for index, row in self.df_values.iterrows():
     row['object'].amount= row['amount']

d) At the end, I re-create all objects:

MyModel.objects.bulk_create(self.df_values['object'].tolist())

Conclusion:

  • In my case, the most time consuming was the bulk update, so re-creating objects solved it for me (from 19 seconds with bulk_update to 10 seconds with delete + bulk_create)
  • In your case, using my approach may improve the time for all other operations.