I am having a question on how Python function. I have a very large dataset (200 GB) and I am going to use python to iterate through lines, store data in a dictionary and then perform some calculation. Finally, I will write the computed data to a CSV file. My concern is the capacity of my computer. I am afraid (or pretty sure) my RAM cannot store that large dataset. Is there a better way? Here is the structure of input data:
#RIC Date[L] Time[L] Type ALP-L1-BidPrice ALP-L1-BidSize ALP-L1-AskPrice ALP-L1-AskSize ALP-L2-BidPrice ALP-L2-BidSize ALP-L2-AskPrice ALP-L2-AskSize ALP-L3-BidPrice ALP-L3-BidSize ALP-L3-AskPrice ALP-L3-AskSize ALP-L4-BidPrice ALP-L4-BidSize ALP-L4-AskPrice ALP-L4-AskSize ALP-L5-BidPrice ALP-L5-BidSize ALP-L5-AskPrice ALP-L5-AskSize TOR-L1-BidPrice TOR-L1-BidSize TOR-L1-AskPrice TOR-L1-AskSize TOR-L2-BidPrice TOR-L2-BidSize TOR-L2-AskPrice TOR-L2-AskSize TOR-L3-BidPrice TOR-L3-BidSize TOR-L3-AskPrice TOR-L3-AskSize TOR-L4-BidPrice TOR-L4-BidSize TOR-L4-AskPrice TOR-L4-AskSize TOR-L5-BidPrice TOR-L5-BidSize TOR-L5-AskPrice TOR-L5-AskSize
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 16000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 46000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 22000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 40000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 44000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 46000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 38000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
Here is what I attempt to do:
- read in ta data and store them into a dictionary with keys [symbol][time][bid] and [ask] etc
- at any point in time, find the best bid price and best ask price (this requires sorting horizontally/among the values in the key which I don't know how) as the bid and ask prices come from different exchanges, we need to find the best prices and rank them from the best to the worst along with volume for that particular price.
- export to a csv file.
Here is my attempt fot the code. Please help me to write it more efficient:
# this file calculate the depth up to $50,000
import csv
from math import ceil
from collections import defaultdict
# open csv file
csv_file = open('2016_01_04-data_3_stocks.csv', 'rU')
reader = csv.DictReader(csv_file)
# Set variables:
date = None
exchange_depth = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
effective_spread = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
time_bucket = [i * 100000.0 for i in range(0, 57600000000 / 100000)]
# Set functions
def time_to_milli(times):
hours = float(times.split(':')[0]) * 60 * 60 * 1000000
minutes = float(times.split(':')[1]) * 60 * 1000000
seconds = float(times.split(':')[2]) * 1000000
milliseconds = float(times.split('.')[1])
timestamp = hours + minutes + seconds + milliseconds
return timestamp
# Extract data
for i in reader:
if not bool(date):
date = i['Date[L]'][0:4] + "-" + i['Date[L]'][4:6] + "-" + i['Date[L]'][6:8]
security = i['#RIC'].split('.')[0]
exchange = i['#RIC'].split('.')[1]
timestamp = float(time_to_milli(i['Time[L]']))
bucket = ceil(float(time_to_milli(i['Time[L]'])) / 100000.0) * 100000.0
# input bid price and bid size
exchange_depth[security][bucket][Bid][i['ALP-L1-BidPrice']] += i['ALP-L1-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L2-BidPrice']] += i['ALP-L2-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L3-BidPrice']] += i['ALP-L3-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L4-BidPrice']] += i['ALP-L4-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L5-BidPrice']] += i['ALP-L5-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L1-BidPrice']] += i['TOR-L1-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L2-BidPrice']] += i['TOR-L2-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L3-BidPrice']] += i['TOR-L3-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L4-BidPrice']] += i['TOR-L4-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L5-BidPrice']] += i['TOR-L5-BidSize']
# input ask price and ask size
exchange_depth[security][bucket][Ask][i['ALP-L1-AskPrice']] += i['ALP-L1-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L2-AskPrice']] += i['ALP-L2-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L3-AskPrice']] += i['ALP-L3-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L4-AskPrice']] += i['ALP-L4-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L5-AskPrice']] += i['ALP-L5-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L1-AskPrice']] += i['TOR-L1-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L2-AskPrice']] += i['TOR-L2-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L3-AskPrice']] += i['TOR-L3-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L4-AskPrice']] += i['TOR-L4-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L5-AskPrice']] += i['TOR-L5-AskSize']
# Now rank bid price and ask price among exchange_depth[security][bucket][Bid] and exchange_depth[security][bucket][Ask] keys
#I don't know how to do this