I already looked at:
- Why does concatenation of DataFrames get exponentially slower?
- Why is Pandas Concatenation (pandas.concat) so Memory Inefficient?
My problem is slightly different from the links above.
I have approximately 400 txt files of 400 mb each (2.5 million rows and 17 columns)
| (session_date, instrument) | instrument_symbol | trade_number | trade_price | trade_quantity | trade_time | trade_indicator | buy_order_date | sequential_buy_order_number | secondary_order_id | aggressor_buy_order_indicator | sell_order_date | sequential_sell_order_number | secondary_order_id_sell_order | aggressor_sell_order_indicator | cross_trade_indicator | buy_member | sell_member |
|:------------------------------------------|:--------------------|---------------:|--------------:|-----------------:|:----------------|------------------:|:--------------------|------------------------------:|---------------------:|--------------------------------:|:--------------------|-------------------------------:|--------------------------------:|---------------------------------:|------------------------:|-------------:|--------------:|
| (Timestamp('2019-07-02 00:00:00'), 'DI1') | DI1F28 | 30 | 7.47 | 5 | 10:59:13.494000 | 1 | 2019-07-02 00:00:00 | 7.62208e+10 | 1.82363e+06 | 2 | 2019-07-02 00:00:00 | 7.62208e+10 | 1.82363e+06 | 1 | 0 | 3 | 120 |
| (Timestamp('2019-07-02 00:00:00'), 'DOL') | DOLV19 | 10 | 3896 | 5 | 10:58:31.935000 | 1 | 2019-07-02 00:00:00 | 7.42335e+11 | 2.85725e+07 | 2 | 2019-07-02 00:00:00 | 7.42334e+11 | 2.85725e+07 | 2 | 0 | 107 | 93 |
| (Timestamp('2019-07-02 00:00:00'), 'DOL') | DOLV19 | 20 | 3905 | 5 | 11:12:20.306000 | 1 | 2019-07-02 00:00:00 | 7.42335e+11 | 2.94072e+07 | 1 | 2019-07-02 00:00:00 | 7.42335e+11 | 2.94072e+07 | 2 | 0 | 107 | 93 |
| (Timestamp('2019-07-02 00:00:00'), 'WDO') | WDOF20 | 20 | 3890 | 1 | 09:32:45.140000 | 1 | 2019-07-02 00:00:00 | 7.42333e+11 | 2.40179e+07 | 2 | 2019-07-02 00:00:00 | 7.42333e+11 | 2.40179e+07 | 2 | 0 | 107 | 107 |
| (Timestamp('2019-07-02 00:00:00'), 'WDO') | WDOF20 | 10 | 3890 | 18 | 09:32:45.140000 | 1 | 2019-07-02 00:00:00 | 7.42333e+11 | 2.40179e+07 | 2 | 2019-07-02 00:00:00 | 7.42333e+11 | 2.40179e+07 | 2 | 0 | 15 | 15 |
The table's variable is df_neg
and the code bellow will loop through each instrument available (one of the indexes of df_neg
). Will look for an existing parquet file of the selected instrument and will load and concatenate with the new information. After this, will export the result as a parquet file.
The part of the script which I found performance issues was this (I'm gonna skip the first part because it's performing well and it's not relevant):
for inst in instruments['instrument'].drop_duplicates():
df_neg_pq = df_neg.xs(inst, level='instrument', drop_level=False)
# Saving parquet file
parquet_dest = 'C:\\Users\\tbmepyg\\Desktop\\parquet_output_apply\\%s\\%s\\' % (folder_name, inst)
# # Creating Parquet File
# Loading old parquet
df_old_pq = pd.read_parquet(parquet_dest + inst + '.parquet', engine='pyarrow')
# Concatenating - Slow part
start_concat = time.time()
append_data = list()
append_data.append(df_old_pq)
append_data.append(df_neg_pq)
df_neg_pq_append = pd.concat(append_data).drop_duplicates(keep='last')
end_concat = time.time()
time_concat = end_concat - start_concat
print('time concat of %s: %s' % (inst, str(time_concat)))
# Exporting to parquet - Slow part
df_neg_pq_append.to_parquet(parquet_dest + inst + '.parquet', engine='pyarrow', compression='gzip')
The concatenation part is going exponentially slower. I tried to replicate the code of the first link in this question, but it didn't improved the performance. Probably because my concatenation uses just two data frames each loop. Before each iteration, new data is used. That's the difference from my question to the referenced above.
How can I efficiently concat two data frames and export the output as parquet without getting exponentially slower?