2

I already looked at:

My problem is slightly different from the links above.

I have approximately 400 txt files of 400 mb each (2.5 million rows and 17 columns)

|  (session_date, instrument)               | instrument_symbol   |   trade_number |   trade_price |   trade_quantity | trade_time      |   trade_indicator | buy_order_date      |   sequential_buy_order_number |   secondary_order_id |   aggressor_buy_order_indicator | sell_order_date     |   sequential_sell_order_number |   secondary_order_id_sell_order |   aggressor_sell_order_indicator |   cross_trade_indicator |   buy_member |   sell_member |
|:------------------------------------------|:--------------------|---------------:|--------------:|-----------------:|:----------------|------------------:|:--------------------|------------------------------:|---------------------:|--------------------------------:|:--------------------|-------------------------------:|--------------------------------:|---------------------------------:|------------------------:|-------------:|--------------:|
| (Timestamp('2019-07-02 00:00:00'), 'DI1') | DI1F28              |             30 |          7.47 |                5 | 10:59:13.494000 |                 1 | 2019-07-02 00:00:00 |                   7.62208e+10 |          1.82363e+06 |                               2 | 2019-07-02 00:00:00 |                    7.62208e+10 |                     1.82363e+06 |                                1 |                       0 |            3 |           120 |
| (Timestamp('2019-07-02 00:00:00'), 'DOL') | DOLV19              |             10 |       3896    |                5 | 10:58:31.935000 |                 1 | 2019-07-02 00:00:00 |                   7.42335e+11 |          2.85725e+07 |                               2 | 2019-07-02 00:00:00 |                    7.42334e+11 |                     2.85725e+07 |                                2 |                       0 |          107 |            93 |
| (Timestamp('2019-07-02 00:00:00'), 'DOL') | DOLV19              |             20 |       3905    |                5 | 11:12:20.306000 |                 1 | 2019-07-02 00:00:00 |                   7.42335e+11 |          2.94072e+07 |                               1 | 2019-07-02 00:00:00 |                    7.42335e+11 |                     2.94072e+07 |                                2 |                       0 |          107 |            93 |
| (Timestamp('2019-07-02 00:00:00'), 'WDO') | WDOF20              |             20 |       3890    |                1 | 09:32:45.140000 |                 1 | 2019-07-02 00:00:00 |                   7.42333e+11 |          2.40179e+07 |                               2 | 2019-07-02 00:00:00 |                    7.42333e+11 |                     2.40179e+07 |                                2 |                       0 |          107 |           107 |
| (Timestamp('2019-07-02 00:00:00'), 'WDO') | WDOF20              |             10 |       3890    |               18 | 09:32:45.140000 |                 1 | 2019-07-02 00:00:00 |                   7.42333e+11 |          2.40179e+07 |                               2 | 2019-07-02 00:00:00 |                    7.42333e+11 |                     2.40179e+07 |                                2 |                       0 |           15 |            15 |

The table's variable is df_neg and the code bellow will loop through each instrument available (one of the indexes of df_neg). Will look for an existing parquet file of the selected instrument and will load and concatenate with the new information. After this, will export the result as a parquet file.

The part of the script which I found performance issues was this (I'm gonna skip the first part because it's performing well and it's not relevant):

for inst in instruments['instrument'].drop_duplicates():
    df_neg_pq = df_neg.xs(inst, level='instrument', drop_level=False)
    # Saving parquet file
    parquet_dest = 'C:\\Users\\tbmepyg\\Desktop\\parquet_output_apply\\%s\\%s\\' % (folder_name, inst)
    # # Creating Parquet File
    # Loading old parquet
    df_old_pq = pd.read_parquet(parquet_dest + inst + '.parquet', engine='pyarrow')

    # Concatenating - Slow part
    start_concat = time.time()
    append_data = list()
    append_data.append(df_old_pq)
    append_data.append(df_neg_pq)
    df_neg_pq_append = pd.concat(append_data).drop_duplicates(keep='last')
    end_concat = time.time()
    time_concat = end_concat - start_concat
    print('time concat of %s: %s' % (inst, str(time_concat)))

    # Exporting to parquet - Slow part
    df_neg_pq_append.to_parquet(parquet_dest + inst + '.parquet', engine='pyarrow', compression='gzip')

The concatenation part is going exponentially slower. I tried to replicate the code of the first link in this question, but it didn't improved the performance. Probably because my concatenation uses just two data frames each loop. Before each iteration, new data is used. That's the difference from my question to the referenced above.

How can I efficiently concat two data frames and export the output as parquet without getting exponentially slower?

Thales Marques
  • 333
  • 4
  • 18

0 Answers0