I have seen a similar question to this but i think my predicament is different in enough ways to warrant a new question.
I have created a function which opens a csv file and aggregates the data into a json-like dictionary structure based on list of dimensions and metrics.
The problem is when i use it to open a file which is 0.97GB, when i look in my processes the python process is using about 1.02GB of memory. Bearing in mind i am selecting only a fraction of the fields in the file, and the data is aggregated and there i would think by nature it should be smaller. Also the dictionary variable is the only thing which gets returned from the function so shouldn't this mean it's the only thing remaining in the memory after the function has run? Does anyone know why my dictionary object is using so much memory?
**EDIT - Also it's my understanding that csv.reader() is a generator so i'm not even loading the whole file at once, so it must just be the dictionary object using all of the memory?
I'm using Python 2.7 on Windows.
import json
import inspect
from pprint import pprint
import csv
from datetime import datetime
import sys
def jsonify_csv(fileString, dimensions, metrics, struc = {}):
with open(fileString, 'rb') as f:
reader=csv.reader(f)
headings = reader.next()
i = 0
for line in reader:
i+=1
row = {headings[i]:v for i, v in enumerate(line)}
pointer = struc
for dimension in dimensions:
if dimension == 'date':
val = str(datetime.strptime(row[dimension], "%d/%m/%Y").date().month)
else:
val = str(row[dimension])
pointer.setdefault(val, {})
pointer = pointer[val]
for metric in metrics:
pointer.setdefault(metric, 0.0)
try:
pointer[metric] += float(row[metric])
except ValueError:
pass
return struc
start = datetime.today()
dims = ['brand', 'source', 'affiliate', 'country', 'store', 'salesbundle', 'product', 'ordertype', 'returncode', 'supplier', 'category']
metrics = ['sales', 'qty', 'cogs', 'carriagereclaim', 'Carriage Charged Carrier', 'carriage_est', 'mktg_est', 'mktg_cost', 'royalty', 'finance', 'scrap_cost', 'mp_cost', 'budgetsales', 'budgetcosts', 'BSTD', 'budgetaftersales', 'budgetscrap', 'budgetcarriagerecovery', 'budgetcarriagepaid', 'budgetmetapack', 'budgetmarketing', 'budgetaffiliate', 'budgetoffline', 'budgetroyalty', 'budgetfinance', 'bundle_qty', 'misc_adjustments']
jsonified = jsonify_csv('PhocasSales_2015+.csv', dims, metrics)
print 'file opened', datetime.today()-start
stop = raw_input("waiting...")