-1

It's kinda hard to show the dataframe in here, So I uploaded an example csv on the drive.

Here's the link, https://docs.google.com/spreadsheets/d/19m5hAo3wk9VbBNOyPQ96HhsIrjX20aT8NFwDZkkJRAE/edit?usp=sharing

I want to create n number of bids and asks columns in the dataframe using the data from a dictionary that's inside the depth column of the csv file.

I tried doing that in a for loop...

Here's my code,

total_depth_in_columns = 100

def restructure_data(df):
    
    # Created those new columns

    for i in range (total_depth_in_columns):
        df['bid_price_'+str(i+1)] = 0
        df['bid_volume_'+str(i+1)] = 0
        df['ask_price_'+str(i+1)] = 0
        df['ask_volume_'+str(i+1)] = 0


        
    for i in range(len(df)):
        depth = df['depth'].iloc[i]

               
        # Added data to those columns
            
        for j in range (total_depth_in_columns):
            
            df['bid_price_'+str(j+1)].iloc[i] = depth['bids'][j][0]
            df['bid_volume_'+str(j+1)].iloc[i] = depth['bids'][j][1]
            df['ask_price_'+str(j+1)].iloc[i] = depth['asks'][j][0]
            df['ask_volume_'+str(j+1)].iloc[i] = depth['asks'][j][1]
          
    return df

It works fine when I have around 100 data. But when running this code on 10000+ data, it takes hours to process it. Is there any faster way to do these kind of operations using pandas?

Bucky
  • 1,116
  • 2
  • 18
  • 34
  • Code produces errors: 1) *//* is not comment in Python (use #) and 2. `df['bid_price_'+str(j+1)].iloc[i] = depth['bids'][j][0]` gives error **TypeError: string indices must be integers** – DarrylG Aug 07 '22 at 09:43
  • @DarrylG I didn't have any comments. Just added it for stackoverflow explanation. Will change it. – Bucky Aug 07 '22 at 09:45
  • I had already guessed the solution to #1, but what about #2 (which still occurs)? – DarrylG Aug 07 '22 at 09:51
  • Sorry, don't have time now but I would create a separate dataframe for each row. You can create it from the bids/asks list[list] in this manner: https://stackoverflow.com/questions/19112398/getting-list-of-lists-into-pandas-dataframe. Once you have the dataframes use pd.concat to join them together. – John M. Aug 07 '22 at 09:58
  • No problem. May I suggest you can easily convert the strings to dictionaries within the dataframe using: `df['depth'] = df["depth"].astype('str').apply(lambda x: ast.literal_eval(x))`. – DarrylG Aug 07 '22 at 10:09

1 Answers1

1

It's slow to loop through a dataframe.

Solution 1: Working in memory

Code

def restructure_data(df):
    # Depth column is string dictionary.
    #   - Convert to dictionary using ast.literal
    #   - Concatenate list of bids and asks
    df['bids-asks'] = df["depth"].astype('str').apply(lambda x: p['bids'] + p['asks'] if (p:=ast.literal_eval(x)) else x)

    # Names of new columns
    columns = [f"{name}{j}" for j in range(1, (len(df['bids-asks'][0])//4)+1) for name in ["bid_price", "bid_volume", "ask_price", "ask_volume"]]

    # Split bids-asks column into columns
    split_df = pd.DataFrame(df['bids-asks'].tolist(), columns=columns)

    # concat df and split_df
    df = pd.concat([df, split_df], axis=1)

    # Drop bids-asks column
    df.drop('bids-asks', axis=1, inplace=True)
    
    return df

Usage

# Read CSV file into Dataframe
df = pd.read_csv('brr.csv', index_col=[0])
df = restructure_data(df)
# Display header
print(df.head())

Output

Close   depth   bid_price1  bid_volume1 ask_price1  ask_volume1 bid_price2  bid_volume2 ask_price2  ask_volume2 ... ask_price498    ask_volume498   bid_price499    bid_volume499   ask_price499    ask_volume499   bid_price500    bid_volume500   ask_price500    ask_volume500
0   7.245   {'lastUpdateId': 1787277998027, 'E': 165985484...   [7.2450, 652]   [7.2440, 545]   [7.2430, 2643]  [7.2420, 1308]  [7.2410, 2201]  [7.2400, 4207]  [7.2390, 4878]  [7.2380, 5604]  ... [8.2930, 12]    [8.2940, 619]   [8.2950, 16]    [8.2960, 33]    [8.2970, 10]    [8.2980, 13]    [8.2990, 2] [8.3000, 1884]  [8.3010, 3] [8.3020, 1]
1   7.246   {'lastUpdateId': 1787278048764, 'E': 165985485...   [7.2450, 842]   [7.2440, 1121]  [7.2430, 1825]  [7.2420, 2728]  [7.2410, 3047]  [7.2400, 5431]  [7.2390, 4790]  [7.2380, 5526]  ... [8.2930, 12]    [8.2940, 619]   [8.2950, 16]    [8.2960, 33]    [8.2970, 10]    [8.2980, 13]    [8.2990, 2] [8.3000, 1884]  [8.3010, 3] [8.3020, 1]
2   7.247   {'lastUpdateId': 1787278510622, 'E': 165985487...   [7.2460, 861]   [7.2450, 1298]  [7.2440, 1636]  [7.2430, 3669]  [7.2420, 4127]  [7.2410, 3330]  [7.2400, 3865]  [7.2390, 4951]  ... [8.2940, 619]   [8.2950, 16]    [8.2960, 33]    [8.2970, 10]    [8.2980, 13]    [8.2990, 2] [8.3000, 1884]  [8.3010, 3] [8.3020, 1] [8.3030, 9]
3   7.246   {'lastUpdateId': 1787278719415, 'E': 165985488...   [7.2460, 809]   [7.2450, 1872]  [7.2440, 1230]  [7.2430, 3774]  [7.2420, 3498]  [7.2410, 3744]  [7.2400, 3906]  [7.2390, 5100]  ... [8.2940, 619]   [8.2950, 16]    [8.2960, 33]    [8.2970, 10]    [8.2980, 13]    [8.2990, 2] [8.3000, 1884]  [8.3010, 3] [8.3020, 1] [8.3030, 9]
4   7.247   {'lastUpdateId': 1787278800142, 'E': 165985488...   [7.2470, 840]   [7.2460, 2145]  [7.2450, 1797]  [7.2440, 2720]  [7.2430, 6232]  [7.2420, 3448]  [7.2410, 2882]  [7.2400, 3198]  ... [8.2950, 16]    [8.2960, 33]    [8.2970, 10]    [8.2980, 13]    [8.2990, 2] [8.3000, 1884]  [8.3010, 3] [8.3020, 1] [8.3030, 9] [8.3040, 7]
5 rows × 2002 columns

Solution 2: Working on Chunks of Data

The issue with the solution 1 is that it quicly exhaust RAM memory for large data files (e.g. 30K records). The alternative is to process data in chunks by:

  • Use Pandas read_csv chunksize parameter so we can process in chunks
  • Use tqdm task status bar to monitor progress
  • Append chunks processed to an output CSV file

Code

import os
import pandas as pd
import ast

import csv
from pathlib import Path

import time
from tqdm import tqdm

# Main function
def restructure_data(df, columns):
    '''
        Split list in depth column
    
    '''
    # Split depth column into columns
    split_df = pd.DataFrame(df['depth'].tolist(), 
                            columns=columns,            # using specified column names
                            index = df.index)           # index from df

    # concat df and split_df
    df = pd.concat([df, split_df], axis=1)

    return df

# Function utilities
def append_result(filename, modifier = "-result"):
    '''
        Creates an output file path whose name has modified appended into the base name
    '''
    p = Path(filename)
    return f"{Path.joinpath(p.parent, p.stem)}{modifier}{p.suffix}"
    
def repeat_csv(filenm, n_repeats = 3):
    '''
        Duplicates the CSV file data
    '''
    new_file = append_result(filenm, "-repeats")
    with open(new_file, 'w', newline='') as fout, open(filenm, 'r') as fin:
            reader = csv.reader(fin)
            writer = csv.writer(fout)
            writer.writerow(next(reader))   # write header
            
            index = 0                       # keep track of index
            for l in reader:
                for _ in range(n_repeats):          # Repeat rows
                    l[0] = index           # update index value
                    index += 1
                    writer.writerow(l)
    
    return new_file

def get_bids_asks(x):
    '''
        Pulls bids and lists from dictionary of depth column dictionary data
    '''
    p = ast.literal_eval(x)
    return p['bids'] + p['asks']

#################################################
# 1. Generate Data by repeating infile n_repeats.  We also update index which is in column 0
repeated_file = repeat_csv('temp.csv', n_repeats = 2)
    
#################################################
# 2, Main processing

# Results file name
results_file = append_result(repeated_file, "-result")

# Delete results file if already exissts
if os.path.exists(results_file):
    os.remove(results_file)
    
# Process input file in chunks
for i, df in enumerate(tqdm(pd.read_csv(repeated_file, index_col=[0], 
                      converters = {'depth':get_bids_asks},
                      chunksize = 10))):    # Processing 10 rows of input at a time

    if i == 0:
        # Names of new columns
        columns = [f"{name}{j}" for j in range(1, (len(df['depth'][0])//4)+1) for name in ["bid_price", "bid_volume", "ask_price", "ask_volume"]]

    df_result = restructure_data(df, columns)

    # Append to results file
    df_result.to_csv(results_file, mode='a', 
              index = False, 
              header = (i == 0))  # Only write header for first output iteration

Performance

Tested using repeat = 1000 on file linked by OP in question.

  • Repeated data contained 10,000 records
  • Processed file in ~11 minutes (Wall clock time) on older Windows PC (10+ years old).
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • That's pretty creative. Thank you so much for the answer. But sadly that didn't solve the issue. I guess the problem is that depth column in that csv file which makes the dataframe pretty heavy and slows down the whole performance. – Bucky Aug 08 '22 at 06:31
  • @Bucky -- Was there any performance improvement? are you saying it's still too slow on a large dataset? – DarrylG Aug 08 '22 at 06:39
  • @DarryIG yeah performance is better. But when I use it on 30000 rows, it just never stops running. I'm kind of experimenting with the dataframe. So I can't afford to wait hours for it to finish. Just so I can make another change and run the whole program again. On the other side, adding those columns to the csv while mining those tick data takes some time which adds lag to the tick collection. So I decided to save the whole dictionary instead. But it still is unusable. Btw your code is definitely faster. And that's what I asked in the question. So I guess I can accept this as an answer. – Bucky Aug 08 '22 at 06:55
  • @Bucky -- I'll take another look tomorrow (it's 3:00 AM in my time zone). – DarrylG Aug 08 '22 at 06:58
  • @Bucky -- discovered it's problematic to perform the processing in memory due to the large amount of data. Added a 2nd solution to my answer which processes the data in chunks (i.e. uses limited RAM). The desired output is placed in a CSV file rather than trying to contain it in a dataframe. Also, added a task status bar to give you progress feedback. – DarrylG Aug 09 '22 at 14:55