2

I've got a dataframe with some 7 million rows - I'm trying to figure out the best way to add a few more rows to this dataset.

The concatenation is taking circa 8-9 seconds which I feel is too long to add a bunch of rows to an existing DF.

import pandas as pd

rootPath = '/fullPathHere/'
start_time = datetime.datetime.now()
df = pd.read_parquet(rootPath + 'HistoricData.parquet', engine='fastparquet')
print(datetime.datetime.now() - start_time, len(df.index), 'DF read')
# display(df)

start_time = datetime.datetime.now()
df_csv = pd.read_csv(rootPath + 'Full.csv')
print(datetime.datetime.now() - start_time, len(df_csv.index), 'CSV read')
# display(df_csv)

start_time = datetime.datetime.now()
df = df.reset_index(drop=True)
print(datetime.datetime.now() - start_time, 'Reset done')

start_time = datetime.datetime.now()
df = pd.concat([df,df_csv], ignore_index=True, axis=0)
print(datetime.datetime.now() - start_time, 'concat done')

Output:

0:00:00.474582 7081379 DF read
0:00:00.001938 4 CSV read
0:00:00.036305 Reset done
0:00:09.777967 concat done   <<< Problem here
DF is now  7081383

I also tried adding the 4 rows using a basic loc[] instad of pd.concat and it looks like the first row is taking ages to insert.

start_len = len(df.index)
for index, row in df_csv.iterrows():
    start_time = datetime.datetime.now()
    df.loc[start_len]=row
    print(datetime.datetime.now() - start_time, 'Row number ', start_len, ' added')
    start_len += 1

Output:

0:00:00.481056 7081379 DF read
0:00:00.001424 4 CSV read
0:00:00.030245 Reset done
0:00:09.104362 Row number  7081379  added   <<< Problem here too
0:00:00.181974 Row number  7081380  added
0:00:00.124729 Row number  7081381  added
0:00:00.109489 Row number  7081382  added
DF is now  7081383

What am I doing wrong here?

Attempting to add a few rows to an existing dataframe with reasonable performance, ideally within a second or so

  • 1
    Try to add copy=False parameter to the concat function. – Kezif Garbian Jul 14 '23 at 20:59
  • @KezifGarbian copy=False actually added more time to the processing 0:00:00.344648 7081379 DF read 0:00:00.001893 4 CSV read 0:00:00.016629 Reset done 0:00:10.761311 concat done DF is now 7081383 10-12 seconds with that parameter set to false – ravishankarurp Jul 15 '23 at 10:36

1 Answers1

0

I don't have access to your data list but you can try appending instead of concat:

df = df.append(df_csv, ignore_index=True)

That way you don't reallocate your DataFrame every time you run your script.

P_n
  • 940
  • 2
  • 11
  • 25