1

I found that the retrieval speed of the dataframe is very fast. I created 1 million rows of dataframe, and it only took less than 1 second to filter the required data. But why is it so slow when I use the append method to add data to an empty dataframe?

Here is my code, which took more than 2 hours to execute. What am I missing? Or is there a better way to add data than df.append mothod?

import pandas as pd
import datetime
import random

data = pd.DataFrame(columns=('Open','High','Low','Close','Avg20'))
start = datetime.datetime.now()
for i in range(1000000):
        if i % 10000 == 0:
                print(i/1000000*100 , '%completed.')
        data = data.append({'Open':random.random(), 'High':random.random(), 'Low':random.random(), 'Close':random.random(),'Avg9':random.random()},ignore_index=True)
    
end = datetime.datetime.now()
print(start, end)

Thanks in advance.

Sun Jar
  • 91
  • 9
  • 3
    Does this answer your question? [Python - Efficient way to add rows to dataframe](https://stackoverflow.com/questions/41888080/python-efficient-way-to-add-rows-to-dataframe) – Mahrkeenerh Nov 02 '21 at 08:25
  • The slowness of `append` is something that you usually stumble upon when working with `DataFrames`. Can you clarify if you truly need to append row-wise or if you actually have the full dataset available and hence, can create the entire df in one go? – Kosmos Nov 02 '21 at 08:33
  • Hi, Kosmos, thanks for reply, Yes I need add datas row by row, because I didn't have entire data set at begin, all datas came from other data source, so I can't create a whole dataframe at once. – Sun Jar Nov 02 '21 at 08:40
  • HI,Mahrkeenerh, the answer is very helpful, I have tried df.loc method, it's better than df. append but still feel slowly, worse than I use df.iloc to searching data. Maybe there is no better solution for this question. Thank you. – Sun Jar Nov 02 '21 at 09:01

1 Answers1

2

DataFrame append is slow since it effectively means creating an entirely new DataFrame from scratch.

If you just wanted to optimize the code above, you could append all your rows to a list rather than DataFrame (since appending to list is fast) then create the DataFrame outside the loop - passing the list of data.

Similarly if you need to combine many DataFrames, it's fastest to do via a single call to pd.concat rather than many calls to DataFrame.append.

John Greenall
  • 1,670
  • 11
  • 17