0

I'm reading a csv file which is about 1 GB but it winds up taking over 10 GB of memory. Since DictReader returns an iterator over dicts, each of which has the elements of the header string encoded as keys, I can imagine lines taking up twice as much space (~1 GB) but ten times as much? This confuses me.

import csv

def readeverything(filename):
     thefile = open(filename)
     reader = csv.DictReader(thefile, delimiter='\t')
     lines = []

     for datum in reader:
         lines.append(datum)
     thefile.close()

     return lines

The size of the raw string is actually smaller than the size of the parsed dict. I found this out using sys.getsizeof on the first line in the file and on the first record read by csv.DictReader. Therefore, the strict size of the dictionary does not account for the exponential explosion of memory usage when reading the CSV.

gideonite
  • 1,211
  • 1
  • 8
  • 13
  • Python objects have quite a few metadata, it being a dynamic language. Do not keep all the data in a list at once as you do now. – ivan_pozdeev Mar 03 '16 at 21:19
  • I want to sort the lines in the file and then apply other functions based on the sorted list. – gideonite Mar 03 '16 at 21:20
  • @ivan_pozdeev just addressed the claim that this is a duplicate question. If you'd like, I can back it up with some numbers but this is probably meaningless without the original data. – gideonite Mar 03 '16 at 22:24
  • @gideonite: `dict`s have a lot of overhead that `list` and `tuple` do not. If you want to reduce memory overhead dramatically, you can read the header to get the list of column names, and use it to create a `collections.namedtuple` type with appropriate fields. You can then use a regular `csv.reader` to read, wrap the reader with `map(mynamedtuple._make, myreader)` to convert to `namedtuple`s with attributes matching the columns. If necessary, you can call `._asdict()` on them when you actually use them to (temporarily) convert back to an (ordered) `dict`. – ShadowRanger Mar 03 '16 at 22:44
  • Fair enough. I'll try one of those then. If I'm not mistaken, the dict used less memory than the original string as reported by `sys.getsizeof`. – gideonite Mar 03 '16 at 22:48
  • @gideonite: `sys.getsizeof` doesn't recurse, so it's only telling you the overhead of the `dict` structure storing the references to keys and values, not the size occupied by the keys and values themselves. Similarly, most user-defined objects won't include the cost of their `__dict__`, which is where all the actual attributes are stored, making them seem much smaller than they really are. – ShadowRanger Mar 03 '16 at 23:27

0 Answers0