11

I'm facing a memory error in my code. My parser could be summarized like that:

# coding=utf-8
#! /usr/bin/env python
import sys
import json
from collections import defaultdict


class MyParserIter(object):

    def _parse_line(self, line):
        for couple in line.split(","):
            key, value = couple.split(':')[0], couple.split(':')[1]
            self.__hash[key].append(value)

    def __init__(self, line):
        # not the real parsing just a example to parse each
        # line to a dict-like obj
        self.__hash = defaultdict(list)
        self._parse_line(line)

    def __iter__(self):
        return iter(self.__hash.values())

    def to_dict(self):
        return self.__hash

    def __getitem__(self, item):
        return self.__hash[item]

    def free(self, item):
        self.__hash[item] = None

    def free_all(self):
        for k in self.__hash:
            self.free(k)

    def to_json(self):
        return json.dumps(self.to_dict())


def parse_file(file_path):
    list_result = []
    with open(file_path) as fin:
        for line in fin:
            parsed_line_obj = MyParserIter(line)
            list_result.append(parsed_line_obj)
    return list_result


def write_to_file(list_obj):
    with open("out.out", "w") as fout:
        for obj in list_obj:
            json_out = obj.to_json()
            fout.write(json_out + "\n")
            obj.free_all()
            obj = None

if __name__ == '__main__':
        result_list = parse_file('test.in')
        print(sys.getsizeof(result_list))
        write_to_file(result_list)
        print(sys.getsizeof(result_list))
        # the same result for memory usage result_list
        print(sys.getsizeof([None] * len(result_list)))
        # the result is not the same :(

The aim is to parse (large) file, each line transformed to a json object that will be written back to a file.

My goal is to reduce the footprint because in some case this code raises a Memory error. After each fout.write i would like delete (free memory) obj reference.

I tried to set obj to None of call the method obj.free_all() but none of them free the memory. I also used simplejson rather than json, that have reduced the footprint but still to too large in some cases.

test.in is looking like:

test1:OK,test3:OK,...
test1:OK,test3:OK,...
test1:OK,test3:OK,test4:test_again...
....
Ali SAID OMAR
  • 6,404
  • 8
  • 39
  • 56
  • Did you try gc.collect() already? See: http://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python – JonnyTieM Feb 17 '16 at 10:24
  • how big is your test.in? – YOU Feb 17 '16 at 10:31
  • For the real parser the input file is around 300Mb. – Ali SAID OMAR Feb 17 '16 at 10:33
  • @JonnyTieM, yes I did but with terrible negative impact on the performance. – Ali SAID OMAR Feb 17 '16 at 10:36
  • have you tried _unreferencing_ these objects? See it [here](http://stackoverflow.com/questions/15455048/releasing-memory-in-python) – bipartite Feb 17 '16 at 10:39
  • its not that big, but your code can implement without using any class, that can make less memory usage – YOU Feb 17 '16 at 10:40
  • I just summarize my parser in fact my application is https://en.wikipedia.org/wiki/Electronic_data_interchange file parser with plugins pattern (a plugin per norm) and the parsing rules depend on the norm version (in my code, load different class of parser). – Ali SAID OMAR Feb 17 '16 at 10:48
  • as far as I understand your code, one regex sub can do work against the whole file. – YOU Feb 17 '16 at 10:52
  • Yes but would make the code more complicated. Each segment (piece of data could be a line) contains a list of N elements (~20) (each element is key/value) in some standard, element could have list of M (~5) sub elements (key/value). A loader class loads the appropriate definition class to be able to extract structure. At last, each file contains messages, 300 for bigger file. Each message contains K (~ 20) segments. – Ali SAID OMAR Feb 17 '16 at 11:18
  • @bipartite, not yet but I'm afraid for the performance. – Ali SAID OMAR Feb 17 '16 at 11:18
  • @Ali try it first while monitoring its performance. otherwise, you will have the same problem reoccurring. – bipartite Feb 17 '16 at 11:32
  • @bipartite, the use of `gc.collect` inside the loop reduce the performance and really the footprint. – Ali SAID OMAR Feb 17 '16 at 13:47
  • @Ali i believe so but what i'm referring to is _unreferencing_ the object that you have. I, as well, am not using gc.collect that much. would you mind try using del instead of gc.collect? – bipartite Feb 18 '16 at 00:41
  • Not possible to make a del during a loop it will cause an error. :( – Ali SAID OMAR Feb 18 '16 at 08:59
  • `key, value = couple.split(':')[0], couple.split(':')[1]` this is just: `key, value = couple.split(':')` (assuming `couple.count(':') == 1`). – Bakuriu Feb 24 '16 at 16:54
  • Just an example of parsing (not my real parser) – Ali SAID OMAR Feb 24 '16 at 16:58
  • But wait. Cannot you simply change `parse_file` to be a generator and `yield` the single elements instead of storing the whole list? To me the problem seems that the list itself is too big. But then the only solution is to avoid storing *the list* `result_list` not removing the references when converting to json (that's already done by the garbage collector). – Bakuriu Feb 24 '16 at 17:04
  • In fact the result_list must be Json serializable – Ali SAID OMAR Feb 24 '16 at 17:11
  • @Ali: No. As you said in your question: The aim is to parse (large) file, each line transformed to a json object that will be written back to a file. I don't read "it must be stored completely in memory". – hvb Mar 02 '16 at 07:47
  • @hvb, You are right. I will change the parser, but the list was useful because it permits to the user to have the result for the line j. – Ali SAID OMAR Mar 02 '16 at 08:29

2 Answers2

4

Don't store many instance of class in array, instead do it inline. Example.

% cat test.in
test1:OK,test3:OK
test1:OK,test3:OK
test1:OK,test3:OK,test4:test_again

% cat test.py 
import json

with open("test.in", "rb") as src:
    with open("out.out", "wb") as dst:
        for line in src:
            pairs, obj = [x.split(":",1) for x in line.rstrip().split(",")], {}
            for k,v in pairs:
                if k not in obj: obj[k] = []
                obj[k].append(v)
            dst.write(json.dumps(obj)+"\n")

% cat out.out
{"test1": ["OK"], "test3": ["OK"]}
{"test1": ["OK"], "test3": ["OK"]}
{"test1": ["OK"], "test3": ["OK"], "test4": ["test_again"]}

If it is slow, don't write to file line by line, but store dumped json string in array and do dst.write("\n".join(array))

YOU
  • 120,166
  • 34
  • 186
  • 219
  • Thanks but, It will imply to refactor my class and mix the logic (output to file and parse the input). The parser should be able to output, either to file or console, that why I need to store the parsed result. to be able to iter or get a key. At last for the unittests this approach does not suit to me. – Ali SAID OMAR Feb 17 '16 at 13:57
  • @AliSAIDOMAR That's easy. Just write a generator that `yield`s the value (basically put the code from this answer into a `def` and replace `dst.write` with `yield`). Then you can just iterate on the result and write to whatever you want. – Bakuriu Feb 24 '16 at 16:59
  • The main point here is to not store the whole result (in whichever form), but to read/parse/write line by line. The way your original program works, the memory consumption is O(file_length), whereas with YOU's approach (read/parse/write line by line) the memory consumption is O(max_line_length). – hvb Mar 02 '16 at 07:43
2

In order for obj to be free-able, all references to it have to be eliminated. Your loop didn't do that because the reference in list_obj still existed. The following will fix that:

def write_to_file(list_obj):
    with open("out.out", "w") as fout:
        for ix in range(list_obj):
            obj = list_obj[ix]
            list_obj[ix] = None
            json_out = obj.to_json()
            fout.write(json_out + "\n")
            obj.free_all()

Alternatively, you could destructively pop the element from the front of list_obj, although that could potentially cause performance problems if it has to reallocate list_obj too many times. I haven't experimented with this so I'm not really sure. That version looks like this:

def write_to_file(list_obj):
    with open("out.out", "w") as fout:
        while len(list_obj) > 0:
            obj = list_obj.pop(0)
            json_out = obj.to_json()
            fout.write(json_out + "\n")
            obj.free_all()
Tom Karzes
  • 22,815
  • 2
  • 22
  • 41