Is pd.append() the quickest way to join two dataframes?

Question

I'm paginating through an API and saving the request to a dataframe.

I can collect 100 rows at a time and this loop currently runs for over an hour.

I'm worried it's because once I reach over 100,000 rows it becomes very inefficient to append the next 100.

Here's my current code:

while JSONContent['next'][0:10]>unixtime_yesterday:
    try:
        url=...
        JSONContent = requests.request("GET", url).json()

        temp_df=json_normalize(JSONContent['data'])
        temp_df=temp_df[['email','datetime','Campaign Name']]
        temp_df.drop_duplicates(subset=['email','Campaign Name'], keep='last', inplace=True, ignore_index=True)

        df_received=df_received.append(temp_df,ignore_index=True,sort=False)

    except ValueError: 
        print('There was a JSONDecodeError')

To make this as efficient as possible I only keep 3 columns from the whole request. I also drop any duplicates which appear within the 100 rows.

How many dataframes? You're running to at least 1000. Do you want to run this loop infinitely? or how many max? — smci, Apr 19 '22 at 11:45

score 1 · Accepted Answer · answered Apr 19 '22 at 11:36

When you have multiple appends in series, it is often more efficient to create a list of dataframes and to concatenate it at the end than using the pd.append function at each iteration since there is some overhead with the pandas functions.

For example,

%%timeit
dfs= []

for i in range(10000):
    tmp1 = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]])
    dfs.append(tmp1)
pd.concat(dfs)

gives 1.44 s ± 88.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) where the same implementation but using append at each iteration gives 2.81 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is pd.append() the quickest way to join two dataframes?

1 Answers1