0

I am currently modifying a pandas dataframe in a loop structure which looks something like this:

for item in item_list:
    
    ~~~~ do something to the item ~~~~~

    results_df = results_df.append(item)

This code is fine for small items being appended and whenever the results_df is small. However, the items I am appending are reasonably large, and the loop is quite long, which means this loop takes quite a long time to complete due to the large expense of copying the result_df when it becomes large.

One solution I can see is that I could append items to a list in this dictionary, like:

results_dict = {'result_1': [], 'result_2': [], 'result_3': []}
for item in item_list:
    item_1, item_2, item_3 = item

    ~~~~~ do something ~~~~

    results_dict['result_1'].append(item_1)
    results_dict['result_2'].append(item_2)
    results_dict['result_3'].append(item_3)

From the resulting dictionary the dataframe can then be made. This is ok but does not seem optimal. Can anyone think of a better solution? Nb the items in each item in item_list are reasonably large dataframe on which some comoplex processing takes place, and the length of item_list is of the order of 1000

tam63
  • 191
  • 10
  • 2
    It would be helpful to see a [mcve] including sample input(s) and expected output. This question is very dependent on those items. See also: [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Sep 17 '20 at 15:31
  • @G.Anderson the items are numbers and strings, the way they are calculated isn't important as I already know the bottleneck is in the append operation – tam63 Sep 17 '20 at 15:33
  • That isn't the purpose of a minex, it's so that we can actually try out potential solutions for you and have a better idea of your exact needs, so we don't need to guess. – Jasmijn Sep 17 '20 at 15:50
  • What more would you need to know in this case other than that I am appending numbers to a pandas dataframe? it seems that I have defined the problem completely already – tam63 Sep 17 '20 at 15:55
  • Where do the numbers come from, what shape are they, what are their exact types and values etc. – Jasmijn Sep 17 '20 at 16:00

1 Answers1

2

Although you are doing same with the dictionary, according to my understanding, with appending to list as dictionary key-value, you have additional complexity of O(1) for dictionary lookup, for each iteration.

you can make list of or each columns (items in you case) and make a dataframe from these lists

item_1_list = []
item_2_list = []
item_3_list = []

for item in item_list:
    item_1, item_2, item_3 = item
    
    item_1_list.append(item_1)
    item_2_list.append(item_2)
    item_3_list.append(item_3)

df = pd.DataFrame({'item_1': item_1_list, 'item_2': item_2_list,'item_3': item_3_list})
del item_1_list,item_2_list,item_3_list

Although dictionary lookup of O(1) doesnt matter much, but I think you can still be better off with list.

Here are the benchmarks

for your approach

import timeit

start = timeit.default_timer()

results_dict = {'result_1': [], 'result_2': [], 'result_3': []}
for item in range(1000):
    

    

    results_dict['result_1'].append(item)
    results_dict['result_2'].append(item)
    results_dict['result_3'].append(item)
df = pd.DataFrame(results_dict)
stop = timeit.default_timer()

print('Time: ', stop - start) 

Time it took:

Time:  0.013144109999984721

With this approach

import timeit

start = timeit.default_timer()

item_1_list = []
item_2_list = []
item_3_list = []

for item in range(1000):
    
    
    item_1_list.append(item)
    item_2_list.append(item)
    item_3_list.append(item)

df = pd.DataFrame({'item_1': item_1_list, 'item_2': item_2_list,'item_3': item_3_list})


stop = timeit.default_timer()

print('Time: ', stop - start)  

Time it took:

Time:  0.005675986999960969
A.B
  • 20,110
  • 3
  • 37
  • 71
  • This is the solution I have already with the dictionary... – tam63 Sep 17 '20 at 15:36
  • 1
    I have edited the answer and have written why i believe it is still better – A.B Sep 17 '20 at 15:43
  • 1
    A constant factor doesn't change big oh notation. O(3) = O(1). – Jasmijn Sep 17 '20 at 15:51
  • @Jasmijn okay, thanks for guiding, but why is there difference in benchmark, does O(1) still adds? if not, i guess my answer isnt correct? – A.B Sep 17 '20 at 15:52
  • 1
    If two algorithms have the same time complexity, that means they can differ in a constant factor, but as the input grows larger that difference doesn't grow. If the one without the dictionary takes 1 second, the one with dict takes 2.3 seconds, if the one takes 1 hour, the other takes 2.3 hours, etc. Suppose there is another algorithm that could add N items in O(log(N)) time instead of O(N) time, then that would have a far larger impact on the performance of operating on a large dataset. – Jasmijn Sep 17 '20 at 15:59
  • @Jasmijn thankyou for explaining, it makes sense. One last thing, can we ignore this constant factor in this case as it wont have much effect/difference? – A.B Sep 17 '20 at 16:02
  • It depends on the larger context in which this bit of code needs to run. Usually, if you're in a situation where that matters, you wouldn't be using Python in the first place, though. – Jasmijn Sep 17 '20 at 22:12