is reading large datasets (200gb) into dictionaries possible due to RAM constraint?

Question

I am having a question on how Python function. I have a very large dataset (200 GB) and I am going to use python to iterate through lines, store data in a dictionary and then perform some calculation. Finally, I will write the computed data to a CSV file. My concern is the capacity of my computer. I am afraid (or pretty sure) my RAM cannot store that large dataset. Is there a better way? Here is the structure of input data:

#RIC    Date[L] Time[L] Type    ALP-L1-BidPrice ALP-L1-BidSize  ALP-L1-AskPrice ALP-L1-AskSize  ALP-L2-BidPrice ALP-L2-BidSize  ALP-L2-AskPrice ALP-L2-AskSize  ALP-L3-BidPrice ALP-L3-BidSize  ALP-L3-AskPrice ALP-L3-AskSize  ALP-L4-BidPrice ALP-L4-BidSize  ALP-L4-AskPrice ALP-L4-AskSize  ALP-L5-BidPrice ALP-L5-BidSize  ALP-L5-AskPrice ALP-L5-AskSize  TOR-L1-BidPrice TOR-L1-BidSize  TOR-L1-AskPrice TOR-L1-AskSize  TOR-L2-BidPrice TOR-L2-BidSize  TOR-L2-AskPrice TOR-L2-AskSize  TOR-L3-BidPrice TOR-L3-BidSize  TOR-L3-AskPrice TOR-L3-AskSize  TOR-L4-BidPrice TOR-L4-BidSize  TOR-L4-AskPrice TOR-L4-AskSize  TOR-L5-BidPrice TOR-L5-BidSize  TOR-L5-AskPrice TOR-L5-AskSize
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 16000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 46000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 22000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 40000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 44000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 46000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 38000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000

Here is what I attempt to do:

read in ta data and store them into a dictionary with keys [symbol][time][bid] and [ask] etc
at any point in time, find the best bid price and best ask price (this requires sorting horizontally/among the values in the key which I don't know how) as the bid and ask prices come from different exchanges, we need to find the best prices and rank them from the best to the worst along with volume for that particular price.
export to a csv file.

Here is my attempt fot the code. Please help me to write it more efficient:

# this file calculate the depth up to $50,000

import csv
from math import ceil
from collections import defaultdict

# open csv file
csv_file = open('2016_01_04-data_3_stocks.csv', 'rU')
reader = csv.DictReader(csv_file)

# Set variables:
date = None
exchange_depth = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
effective_spread = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
time_bucket = [i * 100000.0 for i in range(0, 57600000000 / 100000)]

# Set functions
def time_to_milli(times):
    hours = float(times.split(':')[0]) * 60 * 60 * 1000000
    minutes = float(times.split(':')[1]) * 60 * 1000000
    seconds = float(times.split(':')[2]) * 1000000
    milliseconds = float(times.split('.')[1])
    timestamp = hours + minutes + seconds + milliseconds
    return timestamp

# Extract data
for i in reader:
    if not bool(date):
        date = i['Date[L]'][0:4] + "-" + i['Date[L]'][4:6] + "-" + i['Date[L]'][6:8]
    security = i['#RIC'].split('.')[0]
    exchange = i['#RIC'].split('.')[1]
    timestamp = float(time_to_milli(i['Time[L]']))
    bucket = ceil(float(time_to_milli(i['Time[L]'])) / 100000.0) * 100000.0
    # input bid price and bid size
    exchange_depth[security][bucket][Bid][i['ALP-L1-BidPrice']] += i['ALP-L1-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L2-BidPrice']] += i['ALP-L2-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L3-BidPrice']] += i['ALP-L3-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L4-BidPrice']] += i['ALP-L4-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L5-BidPrice']] += i['ALP-L5-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L1-BidPrice']] += i['TOR-L1-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L2-BidPrice']] += i['TOR-L2-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L3-BidPrice']] += i['TOR-L3-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L4-BidPrice']] += i['TOR-L4-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L5-BidPrice']] += i['TOR-L5-BidSize']
    # input ask price and ask size
    exchange_depth[security][bucket][Ask][i['ALP-L1-AskPrice']] += i['ALP-L1-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L2-AskPrice']] += i['ALP-L2-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L3-AskPrice']] += i['ALP-L3-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L4-AskPrice']] += i['ALP-L4-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L5-AskPrice']] += i['ALP-L5-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L1-AskPrice']] += i['TOR-L1-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L2-AskPrice']] += i['TOR-L2-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L3-AskPrice']] += i['TOR-L3-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L4-AskPrice']] += i['TOR-L4-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L5-AskPrice']] += i['TOR-L5-AskSize']
# Now rank bid price and ask price among exchange_depth[security][bucket][Bid] and exchange_depth[security][bucket][Ask] keys
    #I don't know how to do this

If you have 200 GB of disk space, your computer will probably just start swapping RAM out. This will technically work but will be very slow, especially if using a spinning disk as opposed to an SSD. Instead, you should probably read in the file a few MB at a time, process those lines, and then write to the output file, and rinse and repeat until you've read and processed everything. — BallpointBen, Oct 31 '16 at 21:20
@BallpointBen thanks, can you please provide some code examples? — duckman, Oct 31 '16 at 21:22
What @BallpointBen said about swapping is only true if you're using a 64-bit version of Python with a 64-bit OS. — martineau, Oct 31 '16 at 21:25
@martineau what happens otherwise? MemoryError? Is this just because 32-bit systems only support 4GB of addressable memory per process? — BallpointBen, Oct 31 '16 at 21:25

score 1 · Answer 1 · answered Oct 31 '16 at 21:27

This is difficult to answer because how you store things matters a lot as you get into things like very large datasets. It also matters how you define the dataset size.

If this is a CSV file that consumes 200GB, it may very well be that you can store it in binary form on your computer and it will take up something like 8GB. Then again, it may be that if you use a python object for each numerical piece of data it could be close to 1TB of real ram used.

If you have a dataset that might fit into ram and want to make it fit into ram, start by using things like numpy which provide an interface over native C structures. You can use things like struct.pack to help coerce your data into the C types.

If your dataset has no chance of fitting into ram, you want to look at other ways of analyzing that data. These other ways are things like databases and/or different languages like "R". You can also go out and buy a server with tons of ram.

For other ways of processing the data, you forgot to mention incrementally (as well as making multiple passes over it). — martineau, Oct 31 '16 at 21:30

is reading large datasets (200gb) into dictionaries possible due to RAM constraint?

1 Answers1