0

I'm paginating through an API and saving the request to a dataframe.

I can collect 100 rows at a time and this loop currently runs for over an hour.

I'm worried it's because once I reach over 100,000 rows it becomes very inefficient to append the next 100.

Here's my current code:

while JSONContent['next'][0:10]>unixtime_yesterday:
    try:
        url=...
        JSONContent = requests.request("GET", url).json()

        temp_df=json_normalize(JSONContent['data'])
        temp_df=temp_df[['email','datetime','Campaign Name']]
        temp_df.drop_duplicates(subset=['email','Campaign Name'], keep='last', inplace=True, ignore_index=True)

        df_received=df_received.append(temp_df,ignore_index=True,sort=False)

    except ValueError: 
        print('There was a JSONDecodeError')

To make this as efficient as possible I only keep 3 columns from the whole request. I also drop any duplicates which appear within the 100 rows.

smci
  • 32,567
  • 20
  • 113
  • 146
  • How many dataframes? You're running to at least 1000. Do you want to run this loop infinitely? or how many max? – smci Apr 19 '22 at 11:45

1 Answers1

1

When you have multiple appends in series, it is often more efficient to create a list of dataframes and to concatenate it at the end than using the pd.append function at each iteration since there is some overhead with the pandas functions.

For example,

%%timeit
dfs= []

for i in range(10000):
    tmp1 = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]])
    dfs.append(tmp1)
pd.concat(dfs)

gives 1.44 s ± 88.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) where the same implementation but using append at each iteration gives 2.81 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

robinood
  • 1,138
  • 8
  • 16