1

I have seen a similar question to this but i think my predicament is different in enough ways to warrant a new question.

I have created a function which opens a csv file and aggregates the data into a json-like dictionary structure based on list of dimensions and metrics.

The problem is when i use it to open a file which is 0.97GB, when i look in my processes the python process is using about 1.02GB of memory. Bearing in mind i am selecting only a fraction of the fields in the file, and the data is aggregated and there i would think by nature it should be smaller. Also the dictionary variable is the only thing which gets returned from the function so shouldn't this mean it's the only thing remaining in the memory after the function has run? Does anyone know why my dictionary object is using so much memory?

**EDIT - Also it's my understanding that csv.reader() is a generator so i'm not even loading the whole file at once, so it must just be the dictionary object using all of the memory?

I'm using Python 2.7 on Windows.

import json
import inspect
from pprint import pprint
import csv
from datetime import datetime
import sys


def jsonify_csv(fileString, dimensions, metrics, struc = {}):
    with open(fileString, 'rb') as f:
        reader=csv.reader(f)
        headings = reader.next()
        i = 0
        for line in reader:
            i+=1
            row =  {headings[i]:v for i, v in enumerate(line)}
            pointer = struc
            for dimension in dimensions:
                if dimension == 'date':
                    val = str(datetime.strptime(row[dimension], "%d/%m/%Y").date().month)
                else:
                    val = str(row[dimension])
                pointer.setdefault(val, {})
                pointer = pointer[val]
            for metric in metrics:
                pointer.setdefault(metric, 0.0)
                try:
                    pointer[metric] += float(row[metric])
                except ValueError:
                    pass
    return struc


start = datetime.today()

dims = ['brand', 'source', 'affiliate', 'country', 'store', 'salesbundle', 'product', 'ordertype', 'returncode', 'supplier', 'category']

metrics = ['sales', 'qty', 'cogs', 'carriagereclaim', 'Carriage Charged Carrier', 'carriage_est', 'mktg_est', 'mktg_cost', 'royalty', 'finance', 'scrap_cost', 'mp_cost', 'budgetsales', 'budgetcosts', 'BSTD', 'budgetaftersales', 'budgetscrap', 'budgetcarriagerecovery', 'budgetcarriagepaid', 'budgetmetapack', 'budgetmarketing', 'budgetaffiliate', 'budgetoffline', 'budgetroyalty', 'budgetfinance', 'bundle_qty', 'misc_adjustments']

jsonified = jsonify_csv('PhocasSales_2015+.csv', dims, metrics)

print 'file opened', datetime.today()-start

stop = raw_input("waiting...")
Community
  • 1
  • 1
teebagz
  • 656
  • 1
  • 4
  • 26
  • Don't use a mutable object as a default parameter. See http://docs.python-guide.org/en/latest/writing/gotchas/ – cdarke Jul 27 '16 at 15:07
  • Hi @cdarke thanks for your answer, please could you elaborate on why? the reason i have included struc = {} is in case i wanted to for example open 5 separate files and have them all stored under separate branches of the same object. e.g. x = {file1 : {}, file2 : {}} – teebagz Jul 27 '16 at 15:09
  • Each call will use the same dictionary. Did you read the link I gave? The empty dictionary `{}` is created as an attribute of the function at compilation time. If you called the function 28 times using the default you will not get 28 different dictionaries, they will all share the same one. Default it to `None` then test its value in the body of the function. – cdarke Jul 27 '16 at 15:12
  • @cdarke sorry i didn't see the link you provided, thank you for this. I change it from a default parameter and pass {} when i call the function instead – teebagz Jul 27 '16 at 15:14
  • or default it none! thanks – teebagz Jul 27 '16 at 15:15
  • See my post as an example of how to handle it. You are not the first to get caught on this! – cdarke Jul 27 '16 at 15:16

1 Answers1

1

Each call will use the same dictionary. See http://docs.python-guide.org/en/latest/writing/gotchas/. The empty dictionary {} is created as an attribute of the function at compilation time.

If you called the function 28 times using the default you will not get 28 different dictionaries, they will all share the same one. Default it to None then test its value in the body of the function.

Try this:

def jsonify_csv(fileString, dimensions, metrics, struc = None):
    if struc is None:
        struc = {}

    with open(fileString, 'rb') as f:
    ... # and so on
cdarke
  • 42,728
  • 8
  • 80
  • 84
  • I've fixed this now but still not sure why a single call of the function is giving me an object which uses so much memory. – teebagz Jul 27 '16 at 15:19